As a test, I created a text file containing a list of all the prime numbers up to 10^8 and tested it with various compressors. The test file is 5,761,455 lines (CR LF terminated) and 56,860,455 bytes like this:

Results are below. It appears that fast adapting models like LZ77 and CM do better than stationary models like BWT, PPM, CTW, and DMC. Otherwise it is probably what you expect with the better compressors on top. I tested each compressor with options for max (or near max) compression, and sometimes with other options as well.Code:2 3 5 7 11 ... 99999971 99999989

One unusual result was that splitting the file into independent blocks improved compression, the opposite of what you would expect. Thus, zpaq -method 5 compresses better than 6, even though the models are the same except that 6 uses a single block and more memory. We also see some unusual results with ppmd, where order 3 compresses better than higher orders, and adding memory makes compression worse.Code:97 primes.zpaq (zpaqd cinst primes.cfg) 3,412,333 primes-7.paq8pxd_v5 (cm) 3,712,790 primes-cO.nz (nanozip optimum2) 3,715,220 primes.nz (nanozip optimum1) 4,102,032 primes-7.ccmx (cm) 4,145,319 primes-method-4nc0.1013i1.1.1.1.1.1.1.1ac0.0.1009.255c0.0.1008.255m16st.zpaq (cm) 4,219,602 primes-method-5.zpaq (cm) 4,251,051 primes-method-6.zpaq (cm) 4,440,874 primes-cc.nz (cm) 4,576,869 primes-7.fp8 (cm) 4,687,363 primes-6.xwrt (xwrt LZMA) 5,121,672 primes-8.lpaq9m (cm) 5,346,772 primes.fb (slim23d, ppm) 5,452,513 primes-method-4.zpaq (cm) 5,523,560 primes-mx.7z (7zip lzma) 5,526,668 primes-m9.arc (freearc) 5,537,634 primes-14.xwrt (xwrt lpaq6) 5,783,409 primes-m6.zcm (cm) 5,973,216 primes.pmm (ppmonstr -m256 -o12 -r1, ppm) 6,109,492 primes-86.cmm4 (cm) 6,146,346 primes-m1500.pmm (ppmonstr 1.5 GB) 7,313,293 primes.sma (smac 1.12) 8,297,539 primes-cd.nz (nz_lzhd) 8,352,499 primes.sr2 (symbol ranking) 8,777,721 primes-m-lzx-21.cab (cabarc -m lzx:21) 8,804,406 primes.comprox (lz77) 9,090,508 primes-3.xwrt (xwrt zlib) 9,928,938 primes.pmd (ppmd -o4 -m10 -r0) 10,956,662 primes-d6-n16M-f16M.ctw (ctw) 11,479,732 primes-method-2.zpaq (lz77+cm) 11,817,976 primes-cf.nz (nanozip lzpf) 11,908,964 primes-m5.rar (lz77) 12,104,762 primes-5.tor (tornado lz77) 12,322,628 primes-b60-m3.bsc (bwt) 13,067,765 primes-11.tor (tornado lz77 max) 14,342,399 primes-9.gz (deflate lz77) 14,342,626 primes-9.zip (deflate lz77) 14,698,852 primes-b60-m4.bsc (sort transform order 4) 15,719,994 primes-1000.hook (dmc, 1 GB) 17,318,522 primes-9.xwrt (xwrt ppmvc) 17,379,492 primes.bz2 (bwt+huff) 18,141,307 primes.Z (compress) 18,620,319 primes-method-3.zpaq (bwt) 18,742,955 primes-method-1.zpaq (lz77+huff) 19,027,109 primes-b60-m6.bsc (sort order 6) 19,558,810 primes-b60.bsc (bwt) 19,811,043 primes.bcm (bwt 64 MB blocks) 21,569,610 primes-1000000000.dmc (dmc 1 GB) 23,481,131 primes-c2.lz4 (lz77) 24,140,714 primes-c1.lz4 (lz77) 26,746,134 primes-c0.lz4 (lz77) 56,860,455 primes (uncompressed, 5764155 lines)

The top result with zpaq (other than the 97 byte archive) uses a custom model that I found experimentally:Code:9,927,489 primes-o3-m256-r1.pmd (ppmd order 3, 256 MB, cut off model) 9,928,938 primes-o4-m256-r0.pmd (restart model from scratch) 9,928,938 primes.pmd (-o4 -m10 -r0) 10,195,171 primes-o2-m256-r1.pmd 10,237,460 primes-o5-m256-r1.pmd 12,296,039 primes-o6-m256-r1.pmd 16,272,258 primes-o8-m10-r0.pmd 19,540,294 primes-o8-m256-r0.pmd 19,668,057 primes-o12-m256-r1.pmd 19,668,073 primes-o16-m256-r1.pmd 20,069,645 primes-o8-m256-r1.pmd

-method 4nc0.1013i1.1.1.1.1.1.1.1ac0.0.1009.255c0.0.1008.2 55m16st

which has the following meaning:

4 - 2^4 = 16 MB blocks.

n - no E8E9 transform.

c0.1013 - ICM with order 0 context + distance to the last occurrence of a 13 (CR) byte.

i1.1.1.1.1.1.1.1 - ISSE chain of 8 components, each increasing the previous context by 1 order (order 1..8 including the ICM context).

a - match model with default parameters.

c0.0.1009.255 - ICM with a sparse context consisting of the current byte (order 0), skipping the next 9 bytes and taking the next byte (bit mask of 255 selecting all bits).

c0.0.1008.255 - sparse context, but skipping 8 bytes.

m16 - mixer with 16 bit (order 1) context.

s - SSE with default parameters (order 0 context)

t - MIX2 mixing the last 2 components with default parameters (order 0 context).

Recall that an ICM is an indirect context model (context -> bit history -> prediction). ISSE is an indirect secondary symbol estimator which adjusts the previous prediction using a bit history to select a pair of weights for a 2 input mixer with inputs the prediction and a constant 1. A match model looks for a context match and predicts the next bit. SSE adjusts a prediction using an interpolated table taking the previous prediction and a context as input. A MIX averages predictions in the logistic domain (log(p(1)/p(0)) using weights selected by a context, then adjust the weights to favor the better models. A MIX2 is a 2 input MIX with weights constrained to add to 1.

But I guess I should explain the 97 byte archive. I wrote the program to create the list of files as a ZPAQL config file, then used the same config to compress the list. The commands are:

zpaqd r primes.cfg p nul: primes (create file primes, about 9 seconds on a 2.0 GHz T3200)

zpaqd cinst primes.cfg primes.zpaq primes (compress to primes.zpaq, 0 seconds)

zpaq x primes.zpaq (optional, decompress, takes 9 seconds)

primes.cfg is below. The post-processor section ignores the decoded output up to EOF, then computes and prints the list of prime numbers using a sieve of Eratosthenes.

The zpaqd "r" command says to run the config file zpaqd.cfg. The "p" says to run the pcomp (post-processor) section, since the hcomp section is empty. It is run with input nul: and output file "primes".Code:(primes.cfg - compress a list of primes up to 10^8) (Public domain) comp 0 0 0 27 0 hcomp (empty = no compression) pcomp discard ; (copy nul: %2) (cp /dev/null $2) a> 255 ifnot halt endif (ignore until EOF) a= 100 a*=a a*=a d=a (d=max prime) (sieve of Eratosthenes: M[i]=1 if i is composite) b= 1 do b++ a=b a<d if (b=2...d-1) a=*b a== 0 if (b is prime?) (mark multiples of b as composite) a=b a+=b c=a do a=c a<d if (c=2*b, 3*b,...) a= 1 *c=a a=c a+=b c=a forever endif (print b in base 10) c=d do (c=100000000, 10000000,..., 10, 1) a=b a<c ifnot a/=c a%= 10 a+= 48 out (print digit) endif a=c a/= 10 c=a a> 0 while a= 13 out a= 10 out (print newline) endif forever endif halt end

The zpaqd command "cinst" says to create a new archive (c) with no comment (i), no filename (n), no checksum (s), and no header locator tag (t). Each of these saves a few bytes. This is followed by the config file, archive, and input files. They are saved in streaming mode in a single block. You can use zpaq to decompress. Since no filename is saved, it just drops the .zpaq extension unless you rename the output, e.g. "zpaq x primes.zpaq primes -to newname".

The pcomp section in primes.cfg specifies a pre-processor "discard", which is supposed to take 2 arguments (input file and output file) and perform the inverse transform before compression. In this case, it means simply ignoring the input and creating an empty file. In Windows, you can create a file named discard.bat with the line "copy nul: %2". In Linux, create a script with "cp /dev/null $2" and chmod +x it. zpaqd will normally (without the s option) run the post-processor at compress time and verify that the output matches the input.

The ZPAQL language is described in libzpaq.h in the zpaq distribution. The program sets the M array to 2^27 bytes and uses a 1 to mark the multiples of a prime number as composite. It keeps the size of the list (10^ in D, and uses B and C as the outer and inner indexes of the sieve algorithm (where *B and *C point into M).