I did some compression tests on synthetic data: a list of prime numbers up to 1 million in 10 different formats. In theory all 10 files should compress to the same size because they all contain the same information. (Actually very small if the compressor recognizes the data as a list of primes, which none of them did). The 10 files are:

msb - 32 bit integer format, most significant byte first.

lsb - 32 bits, LSB first (x86 format).

dbl - each number written twice (2,2,3,3,5,5,7,7,...,999983,999983) as 32 bit ints, MSB first.

hex - each number written as 8 hex digits with no separators. "00000002000000003"...

text - each number written in ASCII decimal followed by a newline "2\n3\n...999983\n".

twice - two copies of msb concatenated together (2,3,5,...999983,2,3,5,...999983), 32 bit ints MSB first.

interlv - all the MSB bytes of msb first, then all the second, third, last byte (0,0,0..., 0,0,0..., 0,0,0..., 2,3,5...).

delta - successive differences as 32 bit ints, MSB first (2,1,2,2,4,2,4...).

bytemap - 1M chars, either '1' if prime or '0' if composite.

bitmap - same but as single bits packed into 125K bytes, LSB first.

I grouped the results into LZ77, BWT, PPM, and CM type algorithms.

The zpaq methods s4.3c and 4.3ci1 are BWT, 16 MB blocks, modeled with an order 0 ICM and with an order 0-1 ICM-ISSE chain. zpaq normally uses the second. s4.0c256 is a stationary order 0 model. s4.0c is an ICM (fast adapting order 0: context -> bit history -> prediction). s4.0.ci1.1.1.1 etc are order 1, 2, 3, 4 ICM-ISSE chains. m adds a mixer. am adds a match model and mixer.Code:msb lsb dbl hex text twice interlv delta bytemap bitmap 313992 313992 627984 627984 538468 627984 313992 313992 1000000 125000 uncompressed 314011 314011 160995 326410 299215 316485 90730 142644 232564 71928 lz5 315511 315546 212296 274813 339411 315571 89784 89296 138739 71007 zpaq -m 1 315402 315446 212283 273113 234554 315454 77504 62441 64726 52886 zpaq -m 2 124751 152820 176242 202786 199829 144201 44959 35988 36004 36450 zpaq -m 3 198314 105536 109362 198771 189936 396668 65146 57278 67579 46203 gzip 198538 105760 109586 199139 190171 396894 65223 52480 55925 44541 zip -9 57724 58701 57727 67378 77168 57848 53934 40044 52875 41368 7zip 57724 58701 57727 67378 77354 57848 53212 40276 46369 41434 7zip -mx 46428 46473 66777 139253 152351 50672 59957 47656 65942 44310 rar 46194 46232 65248 143938 156028 50162 44978 46956 61330 44102 rar -m5 -ma5 49328 43125 59371 67545 64953 49360 44604 42681 74288 43988 freearc 45040 40676 45431 67545 62007 45498 42062 40370 40641 41694 freearc -m9 190356 187421 271302 225982 214919 258811 50965 37831 38183 41196 bzip2 183104 181153 183756 205063 199727 183167 44002 35328 34698 36193 bsc 184914 184445 272700 221881 207723 272718 44994 35903 34983 35764 zpaq -m s4.3c 123855 125097 175124 201847 198919 175014 44096 35225 34788 35674 zpaq -m s4.3ci1 42902 43098 43503 62766 67771 41853 42186 36246 35783 37385 nz 180336 191716 232018 144353 213793 180773 45507 36114 50945 36780 ppmd 50613 49050 50757 66905 72933 50634 41696 34089 41987 35935 ppmonstr 207941 207942 403928 241013 217784 416265 154946 70384 49824 38936 zpaq -m s4.0c256 153306 154014 272911 229866 178614 305787 77212 70693 48121 39292 zpaq -m s4.0c0 85141 110722 110278 150666 118636 171019 44696 64196 47450 38614 zpaq -m s4.0ci1 72609 106695 79154 120408 100269 103116 42723 57541 46586 37520 zpaq -m s4.0ci1.1 70770 91990 74563 100128 96293 91242 42841 40964 46291 36612 zpaq -m s4.0ci1.1.1 70792 91861 74109 92700 90622 73422 42995 36769 45809 36019 zpaq -m s4.0ci1.1.1.1 65157 84709 65644 89223 77471 68307 42291 36843 46252 35979 zpaq -m s4.0ci1.1.1.1m 64888 83952 65536 89095 77486 65159 42420 37064 46893 36072 zpaq -m s4.0ci1.1.1.1am 66131 84847 64891 202786 199829 66761 43239 35988 36004 36450 zpaq -m 4 46419 46543 49942 56787 55973 47096 42641 35519 34450 30518 zpaq -m 5 40741 40026 43083 46822 41930 40965 40249 34466 70477 32860 paq8pxd13 -8 37835 37973 38597 56017 55520 37894 36255 34698 42764 35425 nz -cc

Interesting things to note.

lz5, zpaq -method 1 and 2 do not compress msb or lsb at all. There are no matches of length 4 or more.

Several programs depend on byte order (msb vs lsb) especially zip, gzip, zpaq -m 3 (LZ77 with order 1 literal modeling), freearc, bzip2,

bsc, ppmd, ppmonstr, and zpaq ICM-ISSE chains.

paq8pxd_v13 -8 does really poorly on bytemap, worse than gzip.

zpaq -m 5 bitmap is the best result, probably because the sparse models are useful.

zip and gzip fail to recognize the second copy in 'twice' because they only have a 32K window. Low order context models and most BWT also fail to compress this well. BSC probably does well because of its LZP preprocessor. Not sure why zpaq -m 3 misses it.

nanozip supposedly uses BWT but compresses much better. Hard to speculate since it's closed source.

Here is the program that generates the 10 test files.

Code:// Generate test files for prime number compression benchmark. Public domain. #include <stdio.h> #include <stdlib.h> #include <string.h> unsigned char* sieve; // bit 1 = prime const unsigned N=1000000; // largest prime // is i prime? int prime(unsigned i) { return (sieve[i/8]>>(i&7))&1; } int main() { // init sieve sieve=(unsigned char*)malloc(N/8+1); if (!sieve) return 0; memset(sieve, 255, N/8+1); sieve[0]=0xfc; // 0 and 1 are composite unsigned i, j; for (i=0; i*i<=N; ++i) { if (prime(i)) for (j=i*2; j<N; j+=i) sieve[j/8] &= ~(1<<(j%8)); } // List primes FILE* f1=fopen("text", "wb"); // decimal text, LF terminated FILE* f2=fopen("msb", "wb"); // 32 bit ints, MSB first FILE* f3=fopen("lsb", "wb"); // 32 bit ints, LSB first FILE* f4=fopen("delta", "wb"); // differnces, MSB first FILE* f5=fopen("dbl", "wb"); // 32 bit ints twice each FILE* f6=fopen("twice", "wb"); // 2 copies of msb FILE* f7=fopen("hex", "wb"); // hex, no spaces FILE* f8=fopen("interlv", "wb"); // all MSB digits...all LSB digits FILE* f9=fopen("bytemap", "wb"); // '1' if prime else '0' FILE* f10=fopen("bitmap", "wb"); // packed byte map (inverted) j=0; // previous prime for (i=0; i<N; ++i) { if (prime(i)) { fprintf(f1, "%u\n", i); fprintf(f2, "%c%c%c%c", i>>24, i>>16, i>>8, i); fprintf(f3, "%c%c%c%c", i, i>>8, i>>16, i>>24); j=i-j; fprintf(f4, "%c%c%c%c", j>>24, j>>16, j>>8, j); fprintf(f5, "%c%c%c%c", i>>24, i>>16, i>>8, i); fprintf(f5, "%c%c%c%c", i>>24, i>>16, i>>8, i); fprintf(f6, "%c%c%c%c", i>>24, i>>16, i>>8, i); fprintf(f7, "%08X", i); fprintf(f8, "%c", i>>24); j=i; } fprintf(f9, "%c", '0'+prime(i)); } fwrite(sieve, 1, N/8, f10); for (i=0; i<N; ++i) { if (prime(i)) { fprintf(f6, "%c%c%c%c", i>>24, i>>16, i>>8, i); fprintf(f8, "%c", i>>16); } } for (i=0; i<N; ++i) if (prime(i)) fprintf(f8, "%c", i>>8); for (i=0; i<N; ++i) if (prime(i)) fprintf(f8, "%c", i); return 0; }