I have been doing some experiments with small dictionary encoding to improve compression of text files. The idea is that the dictionary size is limited to 254 words so that each word can be encoded as 1 byte, plus one byte as an escape code and 1 byte to encode uppercase letters. Here are some examples with enwik8. enwik8.c is coded with capitalization modeling which usually works better. enwik8.e uses a dictionary that includes capitalized strings, which produces a smaller output because it emits fewer CAP symbols but usually compresses worse (with some exceptions like zp -m1 and -m2).
The improvement is similar to xwrt which uses a larger dictionary and multi-byte codes to produce an initially smaller file (about 55 MB). However I need more testing to confirm this because xwrt has lots of options.Code:100,000,000 enwik8 67,650,103 enwik8.c 66,549,413 enwik8.e zip -9 36,445,475 enwik8.zip 33,678,716 enwik8.c.zip 33,807,838 enwik8.e.zip 7zip 25,895,909 enwik8.7z 24,418,694 enwik8.c.7z 24,476,773 enwik8.e.7z bsc -b100 20,774,446 enwik8.bsc 20,526,925 enwik8.c.bsc 20,539,539 enwik8.e.bsc ppmd -m256 -o12 22,552,189 enwik8.pmd 21,736,040 enwik8.c.pmd 21,777,497 enwik8.e.pmd zp -m1 -b128 22,823,452 enwik8.zpaq 22,243,089 enwik8.c.zpaq 22,205,411 enwik8.e.zpaq zp -m2 -b128 21,246,043 enwik8.zpaq 21,110,738 enwik8.c.zpaq 21,104,654 enwik8.e.zpaq zp -m3 -b128 20,941,571 enwik8.zpaq 19,802,944 enwik8.c.zpaq 19,850,440 enwik8.e.zpaq zp -m4 -b128 19,448,663 enwik8.zpaq 19,303,662 enwik8.c.zpaq 19,355,944 enwik8.e.zpaq
Decoding is fast and simple. enwik8 decodes in about 1 second on my laptop. The encoded format is 2 bytes to indicate the ESC and CAP bytes followed by a 256 word dictionary (only 254 used) each starting with a length byte, and then the encoded text. ESC means the next byte should be output without decoding. CAP means the next byte to be output should be toggled between upper or lower case (change bit 5).
The tricky part is the encoding. I use byte pair encoding. The idea is to use the least frequent byte (after reserving CAP and ESC) to encode the most frequent byte pair. Repeat until no more space is saved. However this only works as a preprocessor if bytes of the same type are paired in order to not remove semantic boundaries. Thus, letters are paired with letters, digits with digits, white space with white space, and punctuation only with identical characters. To do this efficiently I read the first 2 GB of input into a 256K word hash table and compute the byte codes over that. With capitalization modeling (enwik8.c) capitalized words are stored in the hash table as lower case. This usually builds a dictionary containing only lower case strings. When the file is encoded in a second pass (using greedy parsing), capitalized words are encoded as a CAP before the first code.Code:// Decode file void decode(FILE* in, FILE* out) { enum {TEXT=256, CAP, ESC}; int i=0, j=0, c; int esc=getc(in); int cap=getc(in); unsigned char dict[256][256]; while ((c=getc(in))!=EOF) { if (i<TEXT) { // load dict dict[i][j]=c; if (j==dict[i][0]) j=0, ++i; else ++j; } else if (i==ESC) { putc(c, out); i=TEXT; } else if (c==esc) i=ESC; else if (c==cap) i=CAP; else { for (j=1; j<=dict[c][0]; ++j) { if (i==CAP) putc(dict[c][j]^32, out); else putc(dict[c][j], out); i=TEXT; } } }
Encoding doesn't work on binary files and doesn't work with compressors that already have a text model like nanozip, lpaq1, or hook. Some examples on the maximum compression corpus.
The program displays the dictionary it builds. Here is for enwik8 without capitalization modeling. Notice the uppercase entries in the dictionary. Encoding takes 35 seconds.Code:7zip 845,887 a10.jpg.7z 848,910 a10.jpg.c.7z 848,910 a10.jpg.e.7z 1,287,735 acrord32.exe.7z 1,508,506 acrord32.exe.c.7z 1,487,723 acrord32.exe.e.7z 851,622 english.dic.7z 726,483 english.dic.c.7z 726,483 english.dic.e.7z 3,710,068 FlashMX.pdf.7z 3,784,723 FlashMX.pdf.c.7z 3,733,423 FlashMX.pdf.e.7z 927,360 fp.log.7z 788,872 fp.log.c.7z 784,405 fp.log.e.7z 1,728,239 mso97.dll.7z 1,898,660 mso97.dll.c.7z 1,882,313 mso97.dll.e.7z 794,364 ohs.doc.7z 822,531 ohs.doc.c.7z 834,093 ohs.doc.e.7z 990,707 rafale.bmp.7z 1,064,101 rafale.bmp.c.7z 1,061,787 rafale.bmp.e.7z 621,793 vcfiu.hlp.7z 644,315 vcfiu.hlp.c.7z 643,044 vcfiu.hlp.e.7z 589,754 world95.txt.7z 554,668 world95.txt.c.7z 555,762 world95.txt.e.7z ppmonstr 829,637 a10.jpg.c.pmm 829,637 a10.jpg.e.pmm 829,379 a10.jpg.pmm 1,265,543 acrord32.exe.c.pmm 1,252,208 acrord32.exe.e.pmm 1,177,003 acrord32.exe.pmm 520,464 english.dic.c.pmm 520,464 english.dic.e.pmm 515,592 english.dic.pmm 3,652,935 FlashMX.pdf.c.pmm 3,638,840 FlashMX.pdf.e.pmm 3,636,095 FlashMX.pdf.pmm 412,842 fp.log.c.pmm 412,886 fp.log.e.pmm 467,317 fp.log.pmm 1,631,273 mso97.dll.c.pmm 1,621,471 mso97.dll.e.pmm 1,560,748 mso97.dll.pmm 743,480 ohs.doc.c.pmm 746,714 ohs.doc.e.pmm 738,400 ohs.doc.pmm 787,523 rafale.bmp.c.pmm 786,647 rafale.bmp.e.pmm 752,091 rafale.bmp.pmm 504,072 vcfiu.hlp.c.pmm 503,298 vcfiu.hlp.e.pmm 505,911 vcfiu.hlp.pmm 410,879 world95.txt.c.pmm 412,149 world95.txt.e.pmm 418,450 world95.txt.pmm
Here is with capitalization modeling. Notice the greater number of CAP symbols used.Code:C:\res\zpaq>wbpe e enwik8 x Pass 1, building dictionary........ Read first 100000000 characters. Parsed 97074017 bytes into 36460532 tokens. 252055 of 262144 hash table entries used. Assigned codes ESC=0 (count 0) CAP=1 (count 0) Byte pair encoding... 183 pairs encoded, 751062 escaped. Sorting dictionary... Pass 2: encoding... 99280821 -> 66071351 Code Count Meaning --- --------- ------- 0 772523 ESC 1 1349723 CAP 2 562584 "^J" 3 206250 "^J^J" 4 52954 "^J " 5 100222 "^J " 6 12348864 " " 7 178906 " " 8 46190 "#" 9 479245 "&" 10 102283 "'" 11 287570 "''" 12 89695 "'''" 13 221072 "(" 14 225748 ")" 15 235458 "*" 16 787826 "," 17 326098 "-" 18 794548 "." 19 325049 "/" 20 51130 "//" 21 264686 "0" 22 298296 "1" 23 126841 "19" 24 210974 "2" 25 55723 "200" 26 156663 "3" 27 145926 "4" 28 157908 "5" 29 134227 "6" 30 121640 "7" 31 146017 "8" 32 120162 "9" 33 333366 ":" 34 572842 ";" 35 267136 "<" 36 146194 "=" 37 127359 "==" 38 267136 ">" 39 129706 "A" 40 141510 "B" 41 217587 "C" 42 53177 "Ch" 43 133479 "D" 44 88955 "E" 45 128364 "F" 46 84959 "G" 47 101771 "H" 48 92031 "I" 49 78529 "In" 50 76029 "J" 51 57854 "K" 52 101687 "L" 53 144435 "M" 54 115721 "N" 55 35851 "O" 56 119690 "P" 57 73542 "R" 58 139377 "S" 59 93851 "T" 60 135728 "The" 61 33168 "U" 62 43093 "V" 63 82993 "W" 64 65444 "[" 65 945578 "[[" 66 64457 "]" 67 945639 "]]" 68 49420 "_" 69 705101 "a" 70 110517 "ab" 71 178254 "ac" 72 64957 "act" 73 175677 "ad" 74 59780 "ag" 75 107992 "age" 76 77399 "ain" 77 48120 "ak" 78 449968 "al" 79 125542 "all" 80 103561 "am" 81 107934 "ame" 82 111213 "amp" 83 469017 "an" 84 420045 "and" 85 68130 "ant" 86 110914 "ap" 87 331980 "ar" 88 49785 "ard" 89 102411 "are" 90 79940 "art" 91 43587 "ary" 92 380969 "as" 93 310288 "at" 94 185723 "ate" 95 149248 "ation" 96 39578 "av" 97 52828 "ave" 98 100666 "ay" 99 429849 "b" 100 194678 "be" 101 57891 "ber" 102 103274 "br" 103 96377 "by" 104 610897 "c" 105 258937 "ce" 106 192453 "ch" 107 85905 "cl" 108 150646 "com" 109 164572 "con" 110 58156 "ct" 111 666834 "d" 112 248455 "de" 113 1221736 "e" 114 53924 "ear" 115 73479 "ect" 116 410292 "ed" 117 141593 "el" 118 31384 "ell" 119 124281 "em" 120 262246 "en" 121 29655 "end" 122 129963 "ent" 123 308917 "er" 124 75063 "ere" 125 51855 "ers" 126 216064 "es" 127 38665 "ess" 128 83341 "est" 129 126863 "et" 130 57154 "ew" 131 72890 "ex" 132 53959 "ext" 133 544972 "f" 134 176213 "for" 135 57633 "from" 136 425929 "g" 137 163759 "ge" 138 95324 "gr" 139 116900 "gt" 140 614583 "h" 141 224246 "he" 142 69138 "his" 143 50893 "http" 144 417147 "i" 145 110838 "ia" 146 55903 "ial" 147 71091 "ian" 148 78755 "ib" 149 356898 "ic" 150 56126 "ical" 151 218094 "id" 152 74556 "ies" 153 79288 "if" 154 70348 "ig" 155 46693 "ight" 156 42628 "ign" 157 172671 "il" 158 60393 "ill" 159 137326 "im" 160 668278 "in" 161 78927 "ine" 162 340330 "ing" 163 242601 "ion" 164 61788 "ip" 165 168408 "ir" 166 420521 "is" 167 60869 "ish" 168 105358 "ist" 169 420467 "it" 170 67697 "ity" 171 44490 "iv" 172 102090 "ive" 173 83366 "j" 174 353018 "k" 175 621811 "l" 176 50736 "ld" 177 321022 "le" 178 126239 "lt" 179 128534 "ly" 180 644190 "m" 181 80022 "man" 182 69160 "ment" 183 168774 "mo" 184 870788 "n" 185 55594 "not" 186 496104 "o" 187 59937 "oc" 188 73085 "od" 189 523225 "of" 190 56748 "og" 191 206809 "ol" 192 82871 "om" 193 255716 "on" 194 57031 "one" 195 119217 "op" 196 366314 "or" 197 43529 "ord" 198 59519 "ort" 199 45661 "ory" 200 116251 "os" 201 114537 "ot" 202 161132 "ou" 203 63097 "oun" 204 34986 "our" 205 128946 "ow" 206 564530 "p" 207 83204 "pe" 208 67624 "per" 209 104141 "pl" 210 116408 "pr" 211 104037 "pro" 212 66977 "qu" 213 193734 "quot" 214 731790 "r" 215 423035 "re" 216 132042 "ro" 217 1444579 "s" 218 255455 "se" 219 72496 "sh" 220 189288 "so" 221 110658 "sp" 222 272465 "st" 223 44263 "str" 224 144107 "su" 225 871887 "t" 226 128472 "ter" 227 190671 "th" 228 85800 "that" 229 804295 "the" 230 32295 "ther" 231 53983 "tim" 232 381179 "to" 233 172376 "tr" 234 380651 "u" 235 50291 "ud" 236 111637 "ul" 237 93571 "um" 238 169094 "un" 239 203237 "ur" 240 227602 "us" 241 162596 "ut" 242 344288 "v" 243 113275 "ver" 244 387043 "w" 245 86821 "was" 246 82004 "wh" 247 42847 "which" 248 85852 "with" 249 39633 "ww" 250 107366 "x" 251 606459 "y" 252 124570 "z" 253 39763 "{{" 254 453312 "|" 255 40238 "}}" 64426387 strings encoded 100000000 -> 66549413
Source code (GPL) and Win32 .exeCode:C:\res\zpaq>wbpe c enwik8 x Pass 1, building dictionary........ Read first 100000000 characters. Parsed 97857266 bytes into 36565535 tokens. 245232 of 262144 hash table entries used. Assigned codes ESC=0 (count 0) CAP=1 (count 0) Byte pair encoding... 201 pairs encoded, 1084170 escaped. Sorting dictionary... Pass 2: encoding... 99047898 -> 67008125 Code Count Meaning --- --------- ------- 0 772523 ESC 1 3359921 CAP 2 562584 "^J" 3 206250 "^J^J" 4 52954 "^J " 5 100222 "^J " 6 12348864 " " 7 178906 " " 8 46190 "#" 9 479245 "&" 10 102283 "'" 11 287570 "''" 12 89695 "'''" 13 221072 "(" 14 225748 ")" 15 235458 "*" 16 787826 "," 17 326098 "-" 18 794548 "." 19 325049 "/" 20 51130 "//" 21 264686 "0" 22 298296 "1" 23 126841 "19" 24 210974 "2" 25 55723 "200" 26 156663 "3" 27 145926 "4" 28 157908 "5" 29 134227 "6" 30 121640 "7" 31 146017 "8" 32 120162 "9" 33 333366 ":" 34 572842 ";" 35 267136 "<" 36 146194 "=" 37 127359 "==" 38 267136 ">" 39 99076 "A" 40 162559 "C" 41 91877 "I" 42 139377 "S" 43 65444 "[" 44 945578 "[[" 45 64457 "]" 46 945639 "]]" 47 49420 "_" 48 622778 "a" 49 110123 "ab" 50 169531 "ac" 51 64934 "act" 52 164055 "ad" 53 46671 "af" 54 52188 "ag" 55 107512 "age" 56 58601 "ain" 57 438212 "al" 58 111491 "all" 59 102603 "am" 60 107642 "ame" 61 111211 "amp" 62 468559 "an" 63 420042 "and" 64 68162 "ant" 65 106889 "ap" 66 303135 "ar" 67 49715 "ard" 68 101926 "are" 69 76713 "art" 70 41233 "ary" 71 369990 "as" 72 272083 "at" 73 144682 "ate" 74 35725 "ated" 75 141460 "ation" 76 81877 "au" 77 39551 "av" 78 52668 "ave" 79 83408 "ay" 80 408686 "b" 81 194719 "be" 82 57891 "ber" 83 63093 "bl" 84 99930 "bo" 85 103277 "br" 86 96377 "by" 87 442428 "c" 88 221864 "ce" 89 41350 "cent" 90 252615 "ch" 91 86529 "cl" 92 148011 "co" 93 42220 "col" 94 102984 "com" 95 48096 "comp" 96 164793 "con" 97 82703 "cr" 98 138340 "ct" 99 855872 "d" 100 244571 "de" 101 55783 "der" 102 62946 "du" 103 1157749 "e" 104 51927 "ear" 105 331298 "ed" 106 107253 "el" 107 12352 "ell" 108 114775 "em" 109 214216 "en" 110 21114 "end" 111 101071 "ent" 112 258176 "er" 113 44641 "ere" 114 40312 "ers" 115 161112 "es" 116 17960 "ess" 117 59694 "est" 118 83987 "et" 119 67265 "ex" 120 18272 "ext" 121 560200 "f" 122 176070 "for" 123 67518 "fr" 124 57630 "from" 125 501453 "g" 126 166424 "ge" 127 104595 "gr" 128 116998 "gt" 129 62217 "gu" 130 629229 "h" 131 210843 "he" 132 67559 "his" 133 50893 "http" 134 276990 "i" 135 110838 "ia" 136 55903 "ial" 137 71091 "ian" 138 78755 "ib" 139 356898 "ic" 140 56126 "ical" 141 218094 "id" 142 140311 "ie" 143 74556 "ies" 144 79288 "if" 145 70348 "ig" 146 46693 "ight" 147 42628 "ign" 148 172671 "il" 149 60393 "ill" 150 137491 "im" 151 763833 "in" 152 80695 "ine" 153 340334 "ing" 154 250389 "ion" 155 61788 "ip" 156 168408 "ir" 157 374131 "is" 158 60881 "ish" 159 105773 "ist" 160 420457 "it" 161 67697 "ity" 162 44490 "iv" 163 102090 "ive" 164 159395 "j" 165 361765 "k" 166 97227 "ke" 167 668899 "l" 168 53014 "ld" 169 328743 "le" 170 107760 "lo" 171 121829 "lt" 172 132796 "ly" 173 577372 "m" 174 227923 "ma" 175 80451 "man" 176 71373 "ment" 177 170128 "mo" 178 891134 "n" 179 168242 "ne" 180 55550 "not" 181 528365 "o" 182 57728 "od" 183 522318 "of" 184 160243 "ol" 185 55482 "om" 186 20516 "ome" 187 234676 "on" 188 55724 "one" 189 110545 "op" 190 305918 "or" 191 57598 "ort" 192 45292 "ory" 193 102961 "os" 194 92314 "ot" 195 146244 "ou" 196 35202 "oun" 197 25813 "our" 198 103786 "ow" 199 616438 "p" 200 83723 "pe" 201 67708 "per" 202 51940 "ph" 203 93803 "pl" 204 67261 "pr" 205 46974 "pres" 206 103771 "pro" 207 66977 "qu" 208 193734 "quot" 209 837011 "r" 210 325434 "re" 211 47700 "res" 212 36751 "rev" 213 120999 "ro" 214 1502578 "s" 215 266089 "se" 216 72956 "sh" 217 186438 "so" 218 103277 "sp" 219 307964 "st" 220 54180 "str" 221 138102 "su" 222 900617 "t" 223 186038 "te" 224 118197 "ter" 225 178076 "th" 226 85801 "that" 227 944779 "the" 228 33973 "ther" 229 47542 "this" 230 53818 "tim" 231 383417 "to" 232 156639 "tr" 233 367482 "u" 234 111854 "ul" 235 90105 "um" 236 188192 "un" 237 192422 "ur" 238 190448 "us" 239 142265 "ut" 240 352798 "v" 241 111294 "ver" 242 340997 "w" 243 86849 "was" 244 118813 "we" 245 83322 "wh" 246 42847 "which" 247 85862 "with" 248 52022 "wor" 249 39476 "www" 250 148678 "x" 251 622178 "y" 252 124570 "z" 253 39763 "{{" 254 453312 "|" 255 40238 "}}" 63516847 strings encoded 100000000 -> 67650103
http://mattmahoney.net/dc/wbpe100.zip