Results 1 to 15 of 15

Thread: Compressing prime numbers

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts

    Compressing prime numbers

    As a test, I created a text file containing a list of all the prime numbers up to 10^8 and tested it with various compressors. The test file is 5,761,455 lines (CR LF terminated) and 56,860,455 bytes like this:

    Code:
    2
    3
    5
    7
    11
      ...
    99999971
    99999989
    Results are below. It appears that fast adapting models like LZ77 and CM do better than stationary models like BWT, PPM, CTW, and DMC. Otherwise it is probably what you expect with the better compressors on top. I tested each compressor with options for max (or near max) compression, and sometimes with other options as well.

    Code:
            97 primes.zpaq           (zpaqd cinst primes.cfg)
     3,412,333 primes-7.paq8pxd_v5   (cm)
     3,712,790 primes-cO.nz          (nanozip optimum2)
     3,715,220 primes.nz             (nanozip optimum1)
     4,102,032 primes-7.ccmx         (cm)
     4,145,319 primes-method-4nc0.1013i1.1.1.1.1.1.1.1ac0.0.1009.255c0.0.1008.255m16st.zpaq (cm)
     4,219,602 primes-method-5.zpaq  (cm)
     4,251,051 primes-method-6.zpaq  (cm)
     4,440,874 primes-cc.nz          (cm)
     4,576,869 primes-7.fp8          (cm)
     4,687,363 primes-6.xwrt         (xwrt LZMA)
     5,121,672 primes-8.lpaq9m       (cm)
     5,346,772 primes.fb             (slim23d, ppm)
     5,452,513 primes-method-4.zpaq  (cm)
     5,523,560 primes-mx.7z          (7zip lzma)
     5,526,668 primes-m9.arc         (freearc)
     5,537,634 primes-14.xwrt        (xwrt lpaq6)
     5,783,409 primes-m6.zcm         (cm)
     5,973,216 primes.pmm            (ppmonstr -m256 -o12 -r1, ppm)
     6,109,492 primes-86.cmm4        (cm)
     6,146,346 primes-m1500.pmm      (ppmonstr 1.5 GB)
     7,313,293 primes.sma            (smac 1.12)
     8,297,539 primes-cd.nz          (nz_lzhd)
     8,352,499 primes.sr2            (symbol ranking)
     8,777,721 primes-m-lzx-21.cab   (cabarc -m lzx:21)
     8,804,406 primes.comprox        (lz77)
     9,090,508 primes-3.xwrt         (xwrt zlib)
     9,928,938 primes.pmd            (ppmd -o4 -m10 -r0)
    10,956,662 primes-d6-n16M-f16M.ctw  (ctw)
    11,479,732 primes-method-2.zpaq  (lz77+cm)
    11,817,976 primes-cf.nz          (nanozip lzpf)
    11,908,964 primes-m5.rar         (lz77)
    12,104,762 primes-5.tor          (tornado lz77)
    12,322,628 primes-b60-m3.bsc     (bwt)
    13,067,765 primes-11.tor         (tornado lz77 max)
    14,342,399 primes-9.gz           (deflate lz77)
    14,342,626 primes-9.zip          (deflate lz77)
    14,698,852 primes-b60-m4.bsc     (sort transform order 4)
    15,719,994 primes-1000.hook      (dmc, 1 GB)
    17,318,522 primes-9.xwrt         (xwrt ppmvc)
    17,379,492 primes.bz2            (bwt+huff)
    18,141,307 primes.Z              (compress)
    18,620,319 primes-method-3.zpaq  (bwt)
    18,742,955 primes-method-1.zpaq  (lz77+huff)
    19,027,109 primes-b60-m6.bsc     (sort order 6)
    19,558,810 primes-b60.bsc        (bwt)
    19,811,043 primes.bcm            (bwt 64 MB blocks)
    21,569,610 primes-1000000000.dmc (dmc 1 GB)
    23,481,131 primes-c2.lz4         (lz77)
    24,140,714 primes-c1.lz4         (lz77)
    26,746,134 primes-c0.lz4         (lz77)
    56,860,455 primes                (uncompressed, 5764155 lines)
    One unusual result was that splitting the file into independent blocks improved compression, the opposite of what you would expect. Thus, zpaq -method 5 compresses better than 6, even though the models are the same except that 6 uses a single block and more memory. We also see some unusual results with ppmd, where order 3 compresses better than higher orders, and adding memory makes compression worse.

    Code:
     9,927,489 primes-o3-m256-r1.pmd (ppmd order 3, 256 MB, cut off model)
     9,928,938 primes-o4-m256-r0.pmd (restart model from scratch)
     9,928,938 primes.pmd            (-o4 -m10 -r0)
    10,195,171 primes-o2-m256-r1.pmd
    10,237,460 primes-o5-m256-r1.pmd
    12,296,039 primes-o6-m256-r1.pmd
    16,272,258 primes-o8-m10-r0.pmd
    19,540,294 primes-o8-m256-r0.pmd
    19,668,057 primes-o12-m256-r1.pmd
    19,668,073 primes-o16-m256-r1.pmd
    20,069,645 primes-o8-m256-r1.pmd
    The top result with zpaq (other than the 97 byte archive) uses a custom model that I found experimentally:

    -method 4nc0.1013i1.1.1.1.1.1.1.1ac0.0.1009.255c0.0.1008.2 55m16st

    which has the following meaning:

    4 - 2^4 = 16 MB blocks.
    n - no E8E9 transform.
    c0.1013 - ICM with order 0 context + distance to the last occurrence of a 13 (CR) byte.
    i1.1.1.1.1.1.1.1 - ISSE chain of 8 components, each increasing the previous context by 1 order (order 1..8 including the ICM context).
    a - match model with default parameters.
    c0.0.1009.255 - ICM with a sparse context consisting of the current byte (order 0), skipping the next 9 bytes and taking the next byte (bit mask of 255 selecting all bits).
    c0.0.1008.255 - sparse context, but skipping 8 bytes.
    m16 - mixer with 16 bit (order 1) context.
    s - SSE with default parameters (order 0 context)
    t - MIX2 mixing the last 2 components with default parameters (order 0 context).

    Recall that an ICM is an indirect context model (context -> bit history -> prediction). ISSE is an indirect secondary symbol estimator which adjusts the previous prediction using a bit history to select a pair of weights for a 2 input mixer with inputs the prediction and a constant 1. A match model looks for a context match and predicts the next bit. SSE adjusts a prediction using an interpolated table taking the previous prediction and a context as input. A MIX averages predictions in the logistic domain (log(p(1)/p(0)) using weights selected by a context, then adjust the weights to favor the better models. A MIX2 is a 2 input MIX with weights constrained to add to 1.

    But I guess I should explain the 97 byte archive. I wrote the program to create the list of files as a ZPAQL config file, then used the same config to compress the list. The commands are:

    zpaqd r primes.cfg p nul: primes (create file primes, about 9 seconds on a 2.0 GHz T3200)
    zpaqd cinst primes.cfg primes.zpaq primes (compress to primes.zpaq, 0 seconds)
    zpaq x primes.zpaq (optional, decompress, takes 9 seconds)

    primes.cfg is below. The post-processor section ignores the decoded output up to EOF, then computes and prints the list of prime numbers using a sieve of Eratosthenes.

    Code:
    (primes.cfg - compress a list of primes up to 10^8)
    (Public domain)
    comp 0 0 0 27 0
    hcomp (empty = no compression)
    pcomp discard ; (copy nul: %2) (cp /dev/null $2)
      a> 255 ifnot halt endif (ignore until EOF)
      a= 100 a*=a a*=a d=a (d=max prime)
    
      (sieve of Eratosthenes: M[i]=1 if i is composite)
      b= 1 do b++ a=b a<d if (b=2...d-1)
        a=*b a== 0 if (b is prime?)
    
          (mark multiples of b as composite)
          a=b a+=b c=a do a=c a<d if (c=2*b, 3*b,...)
            a= 1 *c=a
            a=c a+=b c=a
          forever endif
    
          (print b in base 10)
          c=d do  (c=100000000, 10000000,..., 10, 1)
            a=b a<c ifnot
              a/=c a%= 10 a+= 48 out (print digit)
            endif
          a=c a/= 10 c=a a> 0 while
          a= 13 out a= 10 out (print newline)
        endif
      forever endif    
      halt
    end
    The zpaqd "r" command says to run the config file zpaqd.cfg. The "p" says to run the pcomp (post-processor) section, since the hcomp section is empty. It is run with input nul: and output file "primes".

    The zpaqd command "cinst" says to create a new archive (c) with no comment (i), no filename (n), no checksum (s), and no header locator tag (t). Each of these saves a few bytes. This is followed by the config file, archive, and input files. They are saved in streaming mode in a single block. You can use zpaq to decompress. Since no filename is saved, it just drops the .zpaq extension unless you rename the output, e.g. "zpaq x primes.zpaq primes -to newname".

    The pcomp section in primes.cfg specifies a pre-processor "discard", which is supposed to take 2 arguments (input file and output file) and perform the inverse transform before compression. In this case, it means simply ignoring the input and creating an empty file. In Windows, you can create a file named discard.bat with the line "copy nul: %2". In Linux, create a script with "cp /dev/null $2" and chmod +x it. zpaqd will normally (without the s option) run the post-processor at compress time and verify that the output matches the input.

    The ZPAQL language is described in libzpaq.h in the zpaq distribution. The program sets the M array to 2^27 bytes and uses a 1 to mark the multiples of a prime number as composite. It keeps the size of the list (10^ in D, and uses B and C as the outer and inner indexes of the sieve algorithm (where *B and *C point into M).

  2. #2
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    But I guess I should explain the 97 byte archive. I wrote the program to create the list of files as a ZPAQL config file, then used the same config to compress the list.
    I was hoping you were going to say you had discovered an oracle for making Kolmogorov compressors. Was this test meant to simulate real data? How well did you expect it to compress? I'm not surprised that LZ didn't do very well. It would be interesting to compare plain Huffman, and/or an arithmetic coder with a short context. It seems like a plain entropy coder might do better than something like LZ on this type of task.

  3. #3
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    885
    Thanks
    480
    Thanked 278 Times in 118 Posts
    Could this situation (prime compression) be a distant cousin of Dictionary compression (were all words are listed only once), with equivalent consequences ?

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    I wouldn't call it a Kolmogorov oracle. I could probably work on it and get it below 97 bytes, but never know if I had the best solution.

    Actually, LZ does better than BWT, but we kind of expected that because it's also true of sorted data like english.dic

    I did some experiments with some simple order 0 models using a CM (context -> prediction) with various adaption rates and with an ICM (context -> bit history -> prediction). For a CM, higher count limits mean slower adaptation. (The prediction is adjusted by error/count). For a stationary source, a CM with a high count will compress best. For this data, an ICM compresses best. For sorted data, you often see a bit history consisting of a sequence of 0 bits followed by a sequence of 1 bits, and then the context is not seen again. You need an ICM to model this right.

    Code:
    19,282,696 primes-method-6nc.zpaq                       (ICM order 0, 64 MB)
    19,318,668 primes-method-4nc.zpaq                       (ICM order 0, 16 MB)
    23,209,952 primes-method-4nc16.zpaq                     (CM, order 0, max count 16*4-4=60)
    23,319,559 primes-method-4nc8.zpaq                      (CM, max 56)
    23,359,339 primes-method-4nc64.zpaq                     (CM, max 252)
    23,670,797 primes-method-4nc256.zpaq                    (CM, max 1020)
    23,907,266 primes-method-4nc4.zpaq                      (CM, max 12)
    Here are the complete set of experiments I did. I find it interesting that BWT + ICM gives worse compression than without the BWT. The best results use ICM-ISSE chains that include the column position (distance to last carriage return) as context.

    Code:
            97 primes.zpaq                                  (custom model primes.cfg)
     4,145,319 primes-method-4nc0.1013i1.1.1.1.1.1.1.1ac0.0.1009.255c0.0.1008.255m16st.zpaq
     4,219,602 primes-method-5.zpaq                         (cm, 32 MB, with text and column models)
     4,251,051 primes-method-6.zpaq                         (cm, 64 MB)
     5,452,513 primes-method-4.zpaq                         (cm, 16 MB: ICM-ISSE chain 0..5, match, mix)
     6,139,469 primes-method-4nc0.1013.255.255.255.zpaq     (1 context: column+order 3, 16 MB blocks)
     6,161,128 primes-method-4nc0.1013.255.255.zpaq         (ICM column+order 2)
     6,162,945 primes-method-4nc0.1013.255.zpaq             (column+order 1)
     6,168,011 primes-method-4nc0.1013.zpaq                 (column+order 0)
     6,169,547 primes-method-4nc0.1010.zpaq                 (column from LF+order 0)
     6,254,554 primes-method-4nc0.1013.255.255.255.255.zpaq (column+order 4)
     6,769,896 primes-method-4nci1.1.1.1.1.zpaq             (ICM-ISSE chain order 0..5)
     7,001,705 primes-method-6nci1.1.1.1.1.zpaq             (64 MB blocks)
     7,283,668 primes-method-4nci1.1.1.1.zpaq               (ICM-ISSE chain order 0..4)
     7,321,673 primes-method-6nci1.1.1.1.zpaq               (64 MB)
     7,588,258 primes-method-4nci1.1.1.zpaq                 (ICM-ISSE 0..3)
     7,601,402 primes-method-6nci1.1.1.zpaq                 (64 MB)
     7,604,977 primes-method-4nc0.10.zpaq                   (offset%10 + order 0)
     8,095,705 primes-method-4nc0.0.255.255.255.zpaq        (order 3)
     8,107,191 primes-method-4nc0.0.255.255.255.255.zpaq    (order 4)
     8,162,122 primes-method-4nci1.1.zpaq                   (ICM-ISSE 0..2)
     8,180,861 primes-method-6nci1.1.zpaq                   (64 MB)
     8,617,838 primes-method-4nc0.0.255.255.zpaq            (order 2)
    10,722,713 primes-method-4nci1.zpaq                     (ICM-ISSE 0..1)
    10,762,384 primes-method-6nci1.zpaq                     (64 MB)
    11,479,732 primes-method-2.zpaq                         (lz77+cm)
    11,629,928 primes-method-4nc0.0.255.zpaq                (order 1)
    18,203,543 primes-method-3nci1.2.zpaq                   (bwt+order 0,1,3)
    18,308,843 primes-method-3nci1.1.zpaq                   (bwt+order 0..2)
    18,620,319 primes-method-3.zpaq                         (bwt+order 0..1)
    18,742,955 primes-method-1.zpaq                         (lz77+huff)
    19,282,696 primes-method-6nc.zpaq                       (ICM order 0, 64 MB)
    19,318,668 primes-method-4nc.zpaq                       (ICM order 0, 16 MB)
    19,323,164 primes-method-3nc.zpaq                       (bwt+ICM order 0, 16 MB)
    23,209,952 primes-method-4nc16.zpaq                     (CM, order 0, max count 16*4-4=60)
    23,319,559 primes-method-4nc8.zpaq                      (CM, max 56)
    23,359,339 primes-method-4nc64.zpaq                     (CM, max 252)
    23,670,797 primes-method-4nc256.zpaq                    (CM, max 1020)
    23,907,266 primes-method-4nc4.zpaq                      (CM, max 12)

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    This should compress to 85 bytes

    Code:
    (primes2.cfg - compress a list of primes up to 10^8
    Public domain.
    To create:   zpaqd r primes2.cfg p nul: primes (56 MB)
    To compress: zpaqd cinst primes2.cfg primes.zpaq primes (85 bytes)
    discard.bat should contain "copy nul: %2")
    
    comp 0 0 0 27 1
      0 icm 5
    hcomp
      halt
    pcomp discard ;
      a> 255 if (ignore until EOF)
        a= 10 a*=a a*=a a*=a d=a (d=max prime)
    
        (sieve of Eratosthenes: M[i]>0 if i is composite)
        b++ do b++ (b=2...d-1)
          a=*b a> 0 ifnot (b is prime?)
    
            (mark multiples of b as composite)
            c=b
            do a=c a+=b c=a  (c=2*b, 3*b,...)
              *c++
            a<d while
    
            (print b in base 10)
            c=d do  (c=100000000, 10000000,..., 10, 1)
              a=b a<c ifnot
                a/=c a%= 10 a+= 48 out (print digit)
              endif
            a=c a/= 10 c=a a> 0 while
            a= 13 out a= 10 out (print newline)
          endif
        a=b a<d while
      endif
      halt
    end
    Changes:
    - Replaced DO IF ... FOREVER ENDIF loops (equivalent to a while loop) with DO ... WHILE, saving a JMP instruction because the test is at the end.
    - Instead of setting M[i]=1 if composite, I increment it, which uses fewer instructions. This works as long as no number has more than 255 distinct prime factors.
    - Compressed the code with an order 0 ICM. The model adds 3 bytes but the compression saves 5 bytes for a net gain of 2.
    - Replaced B= 1 with B++ knowing that B is 0. Saves 1 byte.
    - Experimented with other variations. For example, A> 0 IFNOT compresses better than A== 0 IF, even though it is the same number of opcodes, probably because the same opcodes appear elsewhere in the program to improve compression (e.g. A> 0 WHILE). Likewise, A= 10 A*=A A*=A A*=A compresses the same as A= 100 A*=A A*=A even though it is one byte longer when uncompressed, because the operand 10 occurs in several other places.
    Last edited by Matt Mahoney; 17th May 2013 at 17:09. Reason: Comment M[i]=1 should be M[i]>0, primes.cfg should be primes2.cfg

  6. #6
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I wouldn't call it a Kolmogorov oracle. I could probably work on it and get it below 97 bytes, but never know if I had the best solution.

    Actually, LZ does better than BWT, but we kind of expected that because it's also true of sorted data like english.dic
    Right... I forgot about all the common prefixes the primes would have. Number theory was never my favorite subject.

    For LZ, the best match would typically be the previous prime, so the distances would be very short. The BWT, in general, doesn't benefit from locality.

    It makes sense that splitting the file into blocks helped compression, because the very best matches are already grouped together. The the more remote the context, the more it would just confuse the model.

    The BWT would tend to be self-defeating on this kind of data. It would indiscriminately sort together the least-significant and most-significant parts of the numbers, and the insignificant digits would get in between the useful contexts, spreading them apart and adding noise. The results would most likely be better if the digits were grouped by significance and sorted separately.

  7. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I wouldn't call it a Kolmogorov oracle. I could probably work on it and get it below 97 bytes, but never know if I had the best solution.
    If you got it down small enough, it could become amenable to an exhaustive search. Kolmogorov@home?

  8. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    Except for the halting problem. The program would have to be really, really small for an exhaustive search. http://oeis.org/A004147

  9. #9
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Except for the halting problem. The program would have to be really, really small for an exhaustive search. http://oeis.org/A004147
    Of course, you are right.

  10. #10
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    How long should it take to compile zpaq? I've been trying to build it on a macintosh, and I noticed it was taking a long time. So I reworked the makefile to build each .cpp file one at a time, and it's been working on libzpaq.cpp for the last 4.5 hours. Is it going to stop, or does this mean something went wrong? The g++ on here is not the real g++, it's a frontend to llvm. So maybe that has something to do with it.

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    On my Linux machine (Core I7 M620, 2.66 GHz, 4 GB, 4 hyperthreads, Ubuntu 64 bit, g++ 4.6.3), zpaq compiles in 11 seconds using Makefile.linux. zpaq and zpaqd together takes 18 seconds (compiling libzpaq.cpp twice).

  12. #12
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I eventually killed the g++ process. I've never seen anything like that before.

  13. #13
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by nburns View Post
    I eventually killed the g++ process. I've never seen anything like that before.
    Contact the developers of this 'g++' which, I guess, is Clang. Though here Clang 3.1 has no problems with zpaq 6.16.

  14. #14
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I don't think it's the same as clang, and I think Apple was largely behind the development of it. I don't own or use a mac outside of work, so the earliest I could try again is Monday. That is, if actual work doesn't get in the way. I got the impression that -fopenmp was dependent on gcc, and that that switch was necessary, but I was probably wrong about that. My interest in helping Apple with debugging their stuff isn't very high.

  15. #15
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    -fopenmp is for libdivsufsort.c. It implies -pthread which is used by zpaq.cpp. libzpaq doesn't use threading by itself. I guess that you only need the options for linking. In Windows, g++ -fopenmp will make divsufsort crash so I don't use it.

Similar Threads

  1. Compressing pi
    By Matt Mahoney in forum Data Compression
    Replies: 24
    Last Post: 26th September 2016, 19:17
  2. Prime: find prime number >N
    By Bulat Ziganshin in forum Download Area
    Replies: 8
    Last Post: 24th August 2013, 00:03
  3. Compressing small packets
    By cbloom in forum Data Compression
    Replies: 4
    Last Post: 1st March 2013, 01:59
  4. On compressing series of ones and zeroes.
    By Chuckie in forum Data Compression
    Replies: 4
    Last Post: 3rd March 2011, 11:33
  5. compressing a really small 1k .COM file
    By Rugxulo in forum Data Compression
    Replies: 3
    Last Post: 28th November 2009, 00:32

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •