Results 1 to 12 of 12

Thread: ZPAQ 1.05 preview

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts

    ZPAQ 1.05 preview

    Attached is a preview of the spec for zpaq 1.05. Code still to be written. The most important part is a guide that actually helps you write configuration files to develop new compression algorithms. It explains how to write ZPAQL code, how to assemble models, what the components and their parameters do, and suggests practical approaches. You might find it useful

    In 1.05 I am actually taking out more code than I'm putting in. Highlights:

    - Preprocessors will be external programs written in any language. ZPAQ will verify that the ZPAQL postprocessing code inverts the transform correctly before compressing by comparing input and output SHA1 checksums after passing through both programs. This is much faster than testing by compressing, decompressing, and comparing. It will be automatic during compression.

    - Removed POST x (E8E9) and POST p (LZP) as these will be external programs instead of built in. class Preprocessor will go away. Kept POST 0 (no postprocessing) for compatibility with older config files. If you want pre/postprocessing, then replace POST in the config file with PCOMP, the external command to do the preprocessing transform, and ZPAQL code to invert it.

    - Some of the lesser used commands have changed to make them more organized. The common commands c (create archive), a (append), x (extract) and l (list) are unchanged, but they take different modifiers. The modifiers allow you to store the full path name or no file name at all, omit the comment, omit the checksum, or be verbose.

    - Removed file segment compression (k command). This isn't really useful, and if you really want to compress a file in parts with different algorithms then you can take it apart externally and compress all but the first piece with no filename so the decompresser will reassemble them by default.

    - Consolidated the debugging commands under one command, r (run). "zpaq rF output input" means run the HCOMP program in config file F with specified output and input files. "zpaq vptrF 1 2 3" means trace (t) the PCOMP code (p) with inputs 1, 2, 3, and compile verbosely (v) outputting a program listing (to verify jump targets and check errors) and C initialization lists for writing internal preprocessors. ZPAQ is really a development environment. Real compressors won't have config files.

    - New extract commands: pnxN. p means don't postprocess (for debugging), n means don't reassemble unnamed segments (for debugging) and N means extract only block N (starting with 1).

    I did not add the new extract commands to unzpaq or zpaqsfx. These are supposed to be small and simple. If you really want to do these things, you can use zpaq instead.
    Attached Files Attached Files

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    i don't understand what you mean by external programs? does this mean that archive may contain arbitrary executable? it's inappropriate because it may be troyan. we need some sort of VM that sandbox code executed from archive itself

  3. #3
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    I guess it means external program for creation and included script for decompression as it was possible since the beginning.
    The only big disadvantage is the speed of the decompressor.

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    What I mean is that if you want to develop a compression algorithm using a transform (say, a dictionary for text, or a 2-D delta filter for images), then you have to write the transform and its inverse. The inverse has to be in ZPAQL so that it can be put in the archive and executed by the decompressor. However there is no reason to include the forward transform code in the archive. You can write that part in any language you want and run it during compression.

    What ZPAQ 1.05 will support is verifying that the forward and reverse transforms are true inverses of each other. During compression it will pass the input files through both programs (without compressing) and verify the data is restored exactly. Only then will it compress the transformed data and put the inverse transform ZPAQL code in the archive.

    ZPAQ is not just a compressor. It is a development environment for creating new compression algorithms. If you wanted to write, say, a BWT based compressor using ZPAQ, then you would write a program to do a forward BWT transform and write a config file with an inverse BWT transform in ZPAQL.

    After you test your code, you could write a production program by moving your forward BWT code into your compressor and use ZPAQ to translate your config file into a pair of array initializations. The production compressor won't use config files. It will just write these arrays into the archive headers. It will take a simple command like:

    zpaq_bwt archive files...

    or

    zpaq_bwt input output

    and then you would use unzpaq to uncompress. Or you could include a copy of the unzpaq code in your program for convenience.

    As an optimization you could do the inverse transform in C++ rather than ZPAQL whenever you recognize specific code in the header. In my tests interpreted ZPAQL is 8 to 20 times slower than optimized C++. However, in a typical compressor, most of the CPU time is spend in the context model instead of executing ZPAQL because the model is run for every bit but the ZPAQL is only run once per byte.

    As an extreme example I wrote a custom (non public) ZPAQ compressor for 16 bit color TIFF images for my company. I used a 10 tap predictive filter followed by a color transform followed by a simple order 0 model (a CM). paq8px_v64 compressed a 12 MB image to 7 MB in 35 minutes. My program compressed the same image to 4.6 MB in 8 seconds and decompressed in 19 seconds. The difference is the transform, which takes 11 seconds in ZPAQL vs 0.5 seconds in C++.

    But like I said this is an extreme example because there was a lot of ZPAQL code and very little modeling. The 8x slowdown I measured earlier was for a CAP transform ('A' -> CAP,'a') which had very little impact on total compression speed because the model had several components.

  5. #5
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    You are right most times preprocessions only have a small percentage in the whole progress. But a possible solution and an interesting programming part would be an often mentioned compiling of the script code by the decoder. Are you still not planning to do this?

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Not for awhile. I think it is more important to develop new compression algorithms first. Optimizing ZPAQL is a hard problem. It might be easier to design a faster language, one with longer instructions.

  7. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I am considering making ZPAQL a structured language. Right now you have to count instructions to code jump targets. It would be nice to write if-else-endif or do-while loops instead. For example if you want something like:

    if (a==b) {...}

    you have to write:

    a==b JF 10 (skip 10 instructions...)

    It would be nice to write instead:

    a==b IF ... ENDIF (actually not case sensitive)

    and let the compiler do the counting. Other constructs:

    IF ... ELSE ... ENDIF
    IFNOT ... ENDIF
    IFNOT ... ELSE ... ENDIF
    DO ... WHILE
    DO ... UNTIL
    DO ... FOREVER

    Comparisons (<, ==, >) come before the IF, IFNOT, WHILE, UNTIL. The negated forms (IFNOT, UNTIL) are needed because there are no (<=, !=, >=) operators in ZPAQL.

    IF and DO nest separately. Thus DO ... IF ... WHILE ... ENDIF will break out of a loop. For more complicated spaghetti code you can always use old style jumps.

    Compiling is simple. There are 2 stacks, one for forward jumps (IF, IFNOT) and one for backward jumps (DO).

    IF = code as JF ? and push address of ? on IF stack to fill in later.
    IFNOT = code as JT ? and push address.
    ELSE = pop, fill in JF/JT target, code JMP ? and push address.
    ENDIF= pop, fill in JF/JT/JMP target.
    DO = push address on DO stack.
    WHILE = code as JT and pop to get target.
    UNTIL = code as JF and pop target.
    FOREVER = code as JMP and pop target.

    JT, JF, JMP have 1 signed byte targets relative to the next instruction in the range -128 to 127. Longer jumps can be coded using LJ with a 2 byte absolute address. For coding loops the choice could be made as needed. For example, "JT 3 LJ xx" instead of "JF x". To make the compiler 1 pass, forward long jumps would need to be coded using 5 byte long forms IFL, IFNOTL, ELSEL.

  8. #8
    Member Nightgunner5's Avatar
    Join Date
    Sep 2009
    Location
    Wisconsin, USA
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    With the example for encryption:
    Code:
    COMP 0 0 0 0 1
    	0 cm 9 255
    HCOMP
    	halt
    PCOMP caesar 5 $zpaq.pre
    	a> 255 jf 1 halt (ignore EOS)
    	a-= 5 out halt (subtract 5 from each byte)
    END
    I'm guessing that "caesar" will be called with the arguments "5", a temp file for output, and the input file. However, there's no way to have a user-specified encryption "password", so the compressed file's encryption would be pointless - It would decrypt itself and have the password included in it in semi-plaintext.

    Plus, this format forces the input file to be the last argument.

    If I were to make some kind of XOR encryption in ZPAQL (which wouldn't be all that hard now that I think about it), the password would be included in the config file as well as the archive!

    Why not something like:

    Code:
    PCOMP caesar $password $zpaq.pre $zpaq.input
    	a> 255 jf 1 halt (ignore EOS)
    	a-= $password out halt (subtract $password from each byte)
    END
    and then have some flag that tells ZPAQ to ask for a password when this specific config file is used?

  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Actually the point is not to encrypt but to illustrate how to use a transform. A caesar cipher is about equally secure whether you give away the key or not

    Also, I am changing the interface so that the transform takes an input file and output file (in that order) as the last 2 arguments. This allows the temporary files to be derived from the archive filename so it is safe to run 2 copies of the compressor in the same directory (to different archives) and the temporary files won't clobber each other.

    A few other minor changes: px command changed to mean extract with paths. h option to r command displays trace in hexadecimal, and arguments to r are reversed (now input, output). And BTW I implemented structured ZPAQL as described and it works. Command summary:

    Code:
    ZPAQ v1.05 archiver, (C) 2009, Ocarina Networks Inc.
    Written by Matt Mahoney, Sep 28 2009.
    This is free software under GPL v3, http://www.gnu.org/copyleft/gpl.html
    
    To compress to new archive: zpaq [pnsiv]c[F] archive files...
    To append to archive:       zpaq [pnsiv]a[F] archive files...
    Optional modifiers:
      p = store filename paths in archive
      n = don't store filenames (extractor will append to last named file)
      s = don't store SHA1 checksums (saves 20 bytes)
      i = don't store file sizes as comments (saves a few bytes)
      v = verbose
      F = use options in configuration file F (min.cfg, max.cfg)
    To list contents: zpaq [v]l archive
      v = verbose
    To extract: zpaq [pnt]x[N] archive [files...]
      p = extract to stored paths instead of current directory
      n = extract unnamed segments as separate files (for debugging)
      t = don't post-process (for debugging)
      N = extract only block N (1, 2, 3...)
      files... = rename extracted files (clobbers)
          otherwise use stored names (does not clobber)
    To debug configuration file F: zpaq [pvt]rF [args...]
      p = run PCOMP (default is to run HCOMP)
      v = verbose compile and show initialization lists
      t = trace (single step), args are numeric inputs
          otherwise args are input, output (default stdin, stdout)
      h = trace display in hexadecimal
    To make self extracting archive: append to a copy of zpaqsfx.exe
    I still need to do more testing and port LZP from min.cfg because the only POST command that works now is 0. I already ported the x transform (E8E9) to an external preprocessor and replaced all the jumps with if-endif and do loops. The preprocessor is written in ZPAQL. I could have used another language but it was easier this way because the two programs differ by only 1 instruction. In exe.cfg (max.cfg with e8e9 transform) I have "pcomp zpaq rexepre.cfg" which runs the ZPAQL preprocessor exepre.cfg.

    zpaq105 cmax.cfg x maxcomp\mso97.dll
    278.539 MB memory required.
    maxcomp\mso97.dll 3782416 -> 1576030
    (45.4 sec)

    zpaq105 cexe.cfg x maxcomp\mso97.dll
    278.539 MB memory required.
    zpaq rexepre.cfg maxcomp\mso97.dll x.$zpaq.pre ... OK
    maxcomp\mso97.dll 3782416 -> 1530715
    (47.7 sec)

    The transform and verification (both in ZPAQL) adds 2 seconds. If the output doesn't match then it prints FAILED and refuses to compress the file.

  10. #10
    Member
    Join Date
    Sep 2009
    Location
    APITRC
    Posts
    27
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Cool

    As far what my little brain understands is the custom language / instructions (that you have defined and) that zpaq requires to compress / decompress and then zpaq embeds that scirpt within compressed data, and unzpaq reuses it to decompress? Are these the .cfg files that we will supply are being counted as integrated compression/decompression routines?

    Am confused at it.

    regards
    Last edited by Scientist; 29th September 2009 at 02:09.

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    zpaq 1.05 has been posted. http://mattmahoney.net/dc/#zpaq

    The config files describe a compression algorithm: a context mixing model (an arrangement of various bit predictors), a program (in ZPAQL) to compute their contexts, and optionally another program in ZPAQL to post-process the data after it has been decoded. If you use post-processing, then you also need a preprocessor that does the inverse, but the preprocessor is only needed during compression. When you compress, the config file will call the preprocessor to create a temporary file. Then it will test your postprocessor to check whether it can restore the original data. If not, ZPAQ will refuse to compress the file because it would not decompress correctly. If the test passes, then ZPAQ compresses the temporary file and appends a description of the model and postprocessing code to the archive. When decompressing, you don't need any other files.

    This is different than version 1.04 because it had only 2 preprocessors (e8e9 and lzp) that were built in and didn't have a compress time check (not that I found any files where they failed). In v1.05 I made these external. There is a program lzppre.exe (in C++) and exepre.cfg (in ZPAQL) that are called from min.cfg and exe.cfg respectively. exe.cfg is just max.exe with the e8e9 transform.

    In v1.04, min.cfg looks like this:

    Code:
    comp 3 3 18 20 1 (hh hm ph pm n)
      0 cm 19 5 (context model size=2^19, limit=5*4)
    hcomp
      *d<>a a^=*d a<<= 8 *d=a (order 3 context)
      halt
    post
      p 127 2 96 (LZP esc minlen hmul (order 4, min length 3))
    end
    The command "p 127 2 96" tells v1.04 to run the LZP preprocessor using 127 as the escape code, 2 as the minimum length-1, and 96 as the hash multiplier (which determines the context order). The preprocessor would insert the corresponding postprocessor code.

    v1.05 does not have any internal preprocessors. min.cfg looks like this:

    Code:
    (zpaq 1.05 minimum (fast) compression)
    
    comp 3 3 18 20 1 (hh hm PH PM n)
      0 cm 19 5 (context model size=2^19, limit=5*4)
    hcomp
      *d<>a a^=*d a<<= 8 *d=a (order 3 context)
      halt
    pcomp lzppre 18 20 127 2 96
      (lzppre PH PM ESC MINLEN HMUL)
      (If you change these values, then change them in the code too)
    
      (The sequence ESC 0 codes for ESC. The sequence ESC LEN
       codes for a match of length LEN+MINLEN at the last place
       in the output buffer M (size 2^PM) that had the same context
       hash in the low PH bits of D. D indexes hash table H
       which points into buffer M, which contains B bytes.
       When called, A contains the byte to be decoded and F=true
       if the last byte was ESC. The rolling context hash D is
       updated by D=D*HMUL+M[B])
    
      if (last byte was ESC then copy from match)
        a> 0 jf 37 (goto output esc)
        a+= 2 (MINLEN)
        r=a 0 (save length in R0)
        c=*d (c points to match)
        do (find match and output it)
          *d=b (update index with last output byte)
          a=*c *b=a b++ c++ out (copy and output matching byte)
          d<>a a*= 96 (HMUL)
          a+=d d<>a (update context hash)
          a=r 0 a-- r=a 0 (decrement length)
        a> 0 while (repeat until length is 0)
        halt
      endif
    
      (otherwise, set F for ESC)
      a== 127 (ESC) if
        halt
      endif
    
      (reset state at EOF)
      a> 255 if
        a>a halt (F=0)
    
        (goto here: output esc)
        a= 127 (ESC)
      endif
      *d=b (update index)
      *b=a b++ out (update buffer and output)
      d<>a a*= 96 (HMUL)
      a+=d d<>a (update context hash)
      halt
    end
    The COMP and HCOMP sections are the same. This describes the context model (a simple order 2 actually, not 3). The PCOMP section starts with the preprocessor command. v1.05 includes a program lzppre.exe (and lzppre.cpp source) that was taken from the old class PreProcessor (now gone) and modified into a standalone program. But instead of inserting the postprocessing code automatically into the archive, it is inserted from the config file.

    You might wonder how this is an improvement. The intention is to make it easier to write new transforms, for example, a dictionary transform for text or a 2-D delta transform for images. Before, you had to modify zpaq.cpp class PreProcessor to add transform code, then write and debug the inverse in ZPAQL and do lots of testing to make sure it never failed to recover the data. Now you don't need to touch zpaq.cpp. You write a separate preprocessor program in any language you want and the postprocessing code in ZPAQL in the config file. ZPAQ will run the input through both programs to make sure the input is recoverable. There are also tools in ZPAQ to debug ZPAQL programs. For example, you can run a config file as a regular program. I did this with the e8e9 transform because the encoder and decoder only differed in 1 instruction so it was easier that way than writing it in C++. exe.cfg (max.cfg with e8e9) does this:

    Code:
    pcomp zpaq rexepre.cfg
    exe.cfg contains the preprocessor code. The "r" command runs the program with an input file and output file as 2 arguments, that PCOMP conveniently appends to the command.

  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    zpaq 1.06. http://mattmahoney.net/dc/

    Updated the spec (revision 1) to add a recommendation for embedding zpaq archives in other data (like self extractors). zpaq 1.06, reference decoder unzpaq 1.06, and self extractor zpaqsfx 1.06 can all now find and extract multiple archives mixed with arbitrary data. ZPAQ adds the "ta" command to append a locater tag so that these programs can find them. This is what the self extracting archives already did. But I just added it to the standard and generalized it. The tag is the last 13 bytes of zpaqsfx.tag (which I chose at random) to stay compatible with older versions. zpaqsfx.exe doesn't really need to have zpaqsfx.tag appended to it now, but I did anyway in case you use "a" instead of "ta" to append to a copy of it. It will work either way.

    I also updated the scope of the spec (it is no longer a "candidate") and corrected some typos.

    unzpaq now extracts to the current directory by default, even if the archive stores path names. This behavior is not required by the standard, but is nicer zpaqsfx still uses stored path names, and zpaq gives you the choice (x or px)

    Well, that's enough features for now. It's complicated enough. I really need to work on actual data compression instead.

Similar Threads

  1. zpaq updates
    By Matt Mahoney in forum Data Compression
    Replies: 2527
    Last Post: 4th May 2019, 13:33
  2. zpaq 1.02 update
    By Matt Mahoney in forum Data Compression
    Replies: 11
    Last Post: 10th July 2009, 01:55
  3. ZPAQ pre-release
    By Matt Mahoney in forum Data Compression
    Replies: 54
    Last Post: 23rd March 2009, 03:17
  4. PAQ9 preview
    By Matt Mahoney in forum Forum Archive
    Replies: 39
    Last Post: 9th January 2008, 06:05
  5. FreeArc 0.40 preview
    By Bulat Ziganshin in forum Forum Archive
    Replies: 16
    Last Post: 17th August 2007, 10:28

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •