Page 1 of 2 12 LastLast
Results 1 to 30 of 41

Thread: Small dictionary prepreprocessing for text files

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts

    Small dictionary prepreprocessing for text files

    I have been doing some experiments with small dictionary encoding to improve compression of text files. The idea is that the dictionary size is limited to 254 words so that each word can be encoded as 1 byte, plus one byte as an escape code and 1 byte to encode uppercase letters. Here are some examples with enwik8. enwik8.c is coded with capitalization modeling which usually works better. enwik8.e uses a dictionary that includes capitalized strings, which produces a smaller output because it emits fewer CAP symbols but usually compresses worse (with some exceptions like zp -m1 and -m2).

    Code:
    100,000,000 enwik8
     67,650,103 enwik8.c
     66,549,413 enwik8.e
    
    zip -9
    36,445,475 enwik8.zip
    33,678,716 enwik8.c.zip
    33,807,838 enwik8.e.zip
    
    7zip
    25,895,909 enwik8.7z
    24,418,694 enwik8.c.7z
    24,476,773 enwik8.e.7z
    
    bsc -b100
    20,774,446 enwik8.bsc
    20,526,925 enwik8.c.bsc
    20,539,539 enwik8.e.bsc
    
    ppmd -m256 -o12
    22,552,189 enwik8.pmd
    21,736,040 enwik8.c.pmd
    21,777,497 enwik8.e.pmd
    
    zp -m1 -b128
    22,823,452 enwik8.zpaq
    22,243,089 enwik8.c.zpaq
    22,205,411 enwik8.e.zpaq
    
    zp -m2 -b128
    21,246,043 enwik8.zpaq
    21,110,738 enwik8.c.zpaq
    21,104,654 enwik8.e.zpaq
    
    zp -m3 -b128
    20,941,571 enwik8.zpaq
    19,802,944 enwik8.c.zpaq
    19,850,440 enwik8.e.zpaq
    
    zp -m4 -b128
    19,448,663 enwik8.zpaq
    19,303,662 enwik8.c.zpaq
    19,355,944 enwik8.e.zpaq
    The improvement is similar to xwrt which uses a larger dictionary and multi-byte codes to produce an initially smaller file (about 55 MB). However I need more testing to confirm this because xwrt has lots of options.

    Decoding is fast and simple. enwik8 decodes in about 1 second on my laptop. The encoded format is 2 bytes to indicate the ESC and CAP bytes followed by a 256 word dictionary (only 254 used) each starting with a length byte, and then the encoded text. ESC means the next byte should be output without decoding. CAP means the next byte to be output should be toggled between upper or lower case (change bit 5).

    Code:
    // Decode file
    void decode(FILE* in, FILE* out) {
      enum {TEXT=256, CAP, ESC};
      int i=0, j=0, c;
      int esc=getc(in);
      int cap=getc(in);
      unsigned char dict[256][256];
      while ((c=getc(in))!=EOF) {
        if (i<TEXT) {  // load dict
          dict[i][j]=c;
          if (j==dict[i][0]) j=0, ++i;
          else ++j;
        }
        else if (i==ESC) {
          putc(c, out);
          i=TEXT;
        }
        else if (c==esc) i=ESC;
        else if (c==cap) i=CAP;
        else {
          for (j=1; j<=dict[c][0]; ++j) {
            if (i==CAP) putc(dict[c][j]^32, out);
            else putc(dict[c][j], out);
            i=TEXT;
          }
        }
      }
    The tricky part is the encoding. I use byte pair encoding. The idea is to use the least frequent byte (after reserving CAP and ESC) to encode the most frequent byte pair. Repeat until no more space is saved. However this only works as a preprocessor if bytes of the same type are paired in order to not remove semantic boundaries. Thus, letters are paired with letters, digits with digits, white space with white space, and punctuation only with identical characters. To do this efficiently I read the first 2 GB of input into a 256K word hash table and compute the byte codes over that. With capitalization modeling (enwik8.c) capitalized words are stored in the hash table as lower case. This usually builds a dictionary containing only lower case strings. When the file is encoded in a second pass (using greedy parsing), capitalized words are encoded as a CAP before the first code.

    Encoding doesn't work on binary files and doesn't work with compressors that already have a text model like nanozip, lpaq1, or hook. Some examples on the maximum compression corpus.

    Code:
    7zip
      845,887 a10.jpg.7z
      848,910 a10.jpg.c.7z
      848,910 a10.jpg.e.7z
    1,287,735 acrord32.exe.7z
    1,508,506 acrord32.exe.c.7z
    1,487,723 acrord32.exe.e.7z
      851,622 english.dic.7z
      726,483 english.dic.c.7z
      726,483 english.dic.e.7z
    3,710,068 FlashMX.pdf.7z
    3,784,723 FlashMX.pdf.c.7z
    3,733,423 FlashMX.pdf.e.7z
      927,360 fp.log.7z
      788,872 fp.log.c.7z
      784,405 fp.log.e.7z
    1,728,239 mso97.dll.7z
    1,898,660 mso97.dll.c.7z
    1,882,313 mso97.dll.e.7z
      794,364 ohs.doc.7z
      822,531 ohs.doc.c.7z
      834,093 ohs.doc.e.7z
      990,707 rafale.bmp.7z
    1,064,101 rafale.bmp.c.7z
    1,061,787 rafale.bmp.e.7z
      621,793 vcfiu.hlp.7z
      644,315 vcfiu.hlp.c.7z
      643,044 vcfiu.hlp.e.7z
      589,754 world95.txt.7z
      554,668 world95.txt.c.7z
      555,762 world95.txt.e.7z
    
    ppmonstr
      829,637 a10.jpg.c.pmm
      829,637 a10.jpg.e.pmm
      829,379 a10.jpg.pmm
    1,265,543 acrord32.exe.c.pmm
    1,252,208 acrord32.exe.e.pmm
    1,177,003 acrord32.exe.pmm
      520,464 english.dic.c.pmm
      520,464 english.dic.e.pmm
      515,592 english.dic.pmm
    3,652,935 FlashMX.pdf.c.pmm
    3,638,840 FlashMX.pdf.e.pmm
    3,636,095 FlashMX.pdf.pmm
      412,842 fp.log.c.pmm
      412,886 fp.log.e.pmm
      467,317 fp.log.pmm
    1,631,273 mso97.dll.c.pmm
    1,621,471 mso97.dll.e.pmm
    1,560,748 mso97.dll.pmm
      743,480 ohs.doc.c.pmm
      746,714 ohs.doc.e.pmm
      738,400 ohs.doc.pmm
      787,523 rafale.bmp.c.pmm
      786,647 rafale.bmp.e.pmm
      752,091 rafale.bmp.pmm
      504,072 vcfiu.hlp.c.pmm
      503,298 vcfiu.hlp.e.pmm
      505,911 vcfiu.hlp.pmm
      410,879 world95.txt.c.pmm
      412,149 world95.txt.e.pmm
      418,450 world95.txt.pmm
    The program displays the dictionary it builds. Here is for enwik8 without capitalization modeling. Notice the uppercase entries in the dictionary. Encoding takes 35 seconds.

    Code:
    C:\res\zpaq>wbpe e enwik8 x
    Pass 1, building dictionary........
    Read first 100000000 characters.
    Parsed 97074017 bytes into 36460532 tokens.
    252055 of 262144 hash table entries used.
    Assigned codes ESC=0 (count 0) CAP=1 (count 0)
    Byte pair encoding...
    183 pairs encoded, 751062 escaped. Sorting dictionary...
    Pass 2: encoding...
    99280821 -> 66071351
    
    Code   Count   Meaning
    ---  --------- -------
      0     772523 ESC
      1    1349723 CAP
      2     562584 "^J"
      3     206250 "^J^J"
      4      52954 "^J    "
      5     100222 "^J      "
      6   12348864 " "
      7     178906 "  "
      8      46190 "#"
      9     479245 "&"
     10     102283 "'"
     11     287570 "''"
     12      89695 "'''"
     13     221072 "("
     14     225748 ")"
     15     235458 "*"
     16     787826 ","
     17     326098 "-"
     18     794548 "."
     19     325049 "/"
     20      51130 "//"
     21     264686 "0"
     22     298296 "1"
     23     126841 "19"
     24     210974 "2"
     25      55723 "200"
     26     156663 "3"
     27     145926 "4"
     28     157908 "5"
     29     134227 "6"
     30     121640 "7"
     31     146017 "8"
     32     120162 "9"
     33     333366 ":"
     34     572842 ";"
     35     267136 "<"
     36     146194 "="
     37     127359 "=="
     38     267136 ">"
     39     129706 "A"
     40     141510 "B"
     41     217587 "C"
     42      53177 "Ch"
     43     133479 "D"
     44      88955 "E"
     45     128364 "F"
     46      84959 "G"
     47     101771 "H"
     48      92031 "I"
     49      78529 "In"
     50      76029 "J"
     51      57854 "K"
     52     101687 "L"
     53     144435 "M"
     54     115721 "N"
     55      35851 "O"
     56     119690 "P"
     57      73542 "R"
     58     139377 "S"
     59      93851 "T"
     60     135728 "The"
     61      33168 "U"
     62      43093 "V"
     63      82993 "W"
     64      65444 "["
     65     945578 "[["
     66      64457 "]"
     67     945639 "]]"
     68      49420 "_"
     69     705101 "a"
     70     110517 "ab"
     71     178254 "ac"
     72      64957 "act"
     73     175677 "ad"
     74      59780 "ag"
     75     107992 "age"
     76      77399 "ain"
     77      48120 "ak"
     78     449968 "al"
     79     125542 "all"
     80     103561 "am"
     81     107934 "ame"
     82     111213 "amp"
     83     469017 "an"
     84     420045 "and"
     85      68130 "ant"
     86     110914 "ap"
     87     331980 "ar"
     88      49785 "ard"
     89     102411 "are"
     90      79940 "art"
     91      43587 "ary"
     92     380969 "as"
     93     310288 "at"
     94     185723 "ate"
     95     149248 "ation"
     96      39578 "av"
     97      52828 "ave"
     98     100666 "ay"
     99     429849 "b"
    100     194678 "be"
    101      57891 "ber"
    102     103274 "br"
    103      96377 "by"
    104     610897 "c"
    105     258937 "ce"
    106     192453 "ch"
    107      85905 "cl"
    108     150646 "com"
    109     164572 "con"
    110      58156 "ct"
    111     666834 "d"
    112     248455 "de"
    113    1221736 "e"
    114      53924 "ear"
    115      73479 "ect"
    116     410292 "ed"
    117     141593 "el"
    118      31384 "ell"
    119     124281 "em"
    120     262246 "en"
    121      29655 "end"
    122     129963 "ent"
    123     308917 "er"
    124      75063 "ere"
    125      51855 "ers"
    126     216064 "es"
    127      38665 "ess"
    128      83341 "est"
    129     126863 "et"
    130      57154 "ew"
    131      72890 "ex"
    132      53959 "ext"
    133     544972 "f"
    134     176213 "for"
    135      57633 "from"
    136     425929 "g"
    137     163759 "ge"
    138      95324 "gr"
    139     116900 "gt"
    140     614583 "h"
    141     224246 "he"
    142      69138 "his"
    143      50893 "http"
    144     417147 "i"
    145     110838 "ia"
    146      55903 "ial"
    147      71091 "ian"
    148      78755 "ib"
    149     356898 "ic"
    150      56126 "ical"
    151     218094 "id"
    152      74556 "ies"
    153      79288 "if"
    154      70348 "ig"
    155      46693 "ight"
    156      42628 "ign"
    157     172671 "il"
    158      60393 "ill"
    159     137326 "im"
    160     668278 "in"
    161      78927 "ine"
    162     340330 "ing"
    163     242601 "ion"
    164      61788 "ip"
    165     168408 "ir"
    166     420521 "is"
    167      60869 "ish"
    168     105358 "ist"
    169     420467 "it"
    170      67697 "ity"
    171      44490 "iv"
    172     102090 "ive"
    173      83366 "j"
    174     353018 "k"
    175     621811 "l"
    176      50736 "ld"
    177     321022 "le"
    178     126239 "lt"
    179     128534 "ly"
    180     644190 "m"
    181      80022 "man"
    182      69160 "ment"
    183     168774 "mo"
    184     870788 "n"
    185      55594 "not"
    186     496104 "o"
    187      59937 "oc"
    188      73085 "od"
    189     523225 "of"
    190      56748 "og"
    191     206809 "ol"
    192      82871 "om"
    193     255716 "on"
    194      57031 "one"
    195     119217 "op"
    196     366314 "or"
    197      43529 "ord"
    198      59519 "ort"
    199      45661 "ory"
    200     116251 "os"
    201     114537 "ot"
    202     161132 "ou"
    203      63097 "oun"
    204      34986 "our"
    205     128946 "ow"
    206     564530 "p"
    207      83204 "pe"
    208      67624 "per"
    209     104141 "pl"
    210     116408 "pr"
    211     104037 "pro"
    212      66977 "qu"
    213     193734 "quot"
    214     731790 "r"
    215     423035 "re"
    216     132042 "ro"
    217    1444579 "s"
    218     255455 "se"
    219      72496 "sh"
    220     189288 "so"
    221     110658 "sp"
    222     272465 "st"
    223      44263 "str"
    224     144107 "su"
    225     871887 "t"
    226     128472 "ter"
    227     190671 "th"
    228      85800 "that"
    229     804295 "the"
    230      32295 "ther"
    231      53983 "tim"
    232     381179 "to"
    233     172376 "tr"
    234     380651 "u"
    235      50291 "ud"
    236     111637 "ul"
    237      93571 "um"
    238     169094 "un"
    239     203237 "ur"
    240     227602 "us"
    241     162596 "ut"
    242     344288 "v"
    243     113275 "ver"
    244     387043 "w"
    245      86821 "was"
    246      82004 "wh"
    247      42847 "which"
    248      85852 "with"
    249      39633 "ww"
    250     107366 "x"
    251     606459 "y"
    252     124570 "z"
    253      39763 "{{"
    254     453312 "|"
    255      40238 "}}"
    
    64426387 strings encoded
    100000000 -> 66549413
    Here is with capitalization modeling. Notice the greater number of CAP symbols used.

    Code:
    C:\res\zpaq>wbpe c enwik8 x
    Pass 1, building dictionary........
    Read first 100000000 characters.
    Parsed 97857266 bytes into 36565535 tokens.
    245232 of 262144 hash table entries used.
    Assigned codes ESC=0 (count 0) CAP=1 (count 0)
    Byte pair encoding...
    201 pairs encoded, 1084170 escaped. Sorting dictionary...
    Pass 2: encoding...
    99047898 -> 67008125
    
    Code   Count   Meaning
    ---  --------- -------
      0     772523 ESC
      1    3359921 CAP
      2     562584 "^J"
      3     206250 "^J^J"
      4      52954 "^J    "
      5     100222 "^J      "
      6   12348864 " "
      7     178906 "  "
      8      46190 "#"
      9     479245 "&"
     10     102283 "'"
     11     287570 "''"
     12      89695 "'''"
     13     221072 "("
     14     225748 ")"
     15     235458 "*"
     16     787826 ","
     17     326098 "-"
     18     794548 "."
     19     325049 "/"
     20      51130 "//"
     21     264686 "0"
     22     298296 "1"
     23     126841 "19"
     24     210974 "2"
     25      55723 "200"
     26     156663 "3"
     27     145926 "4"
     28     157908 "5"
     29     134227 "6"
     30     121640 "7"
     31     146017 "8"
     32     120162 "9"
     33     333366 ":"
     34     572842 ";"
     35     267136 "<"
     36     146194 "="
     37     127359 "=="
     38     267136 ">"
     39      99076 "A"
     40     162559 "C"
     41      91877 "I"
     42     139377 "S"
     43      65444 "["
     44     945578 "[["
     45      64457 "]"
     46     945639 "]]"
     47      49420 "_"
     48     622778 "a"
     49     110123 "ab"
     50     169531 "ac"
     51      64934 "act"
     52     164055 "ad"
     53      46671 "af"
     54      52188 "ag"
     55     107512 "age"
     56      58601 "ain"
     57     438212 "al"
     58     111491 "all"
     59     102603 "am"
     60     107642 "ame"
     61     111211 "amp"
     62     468559 "an"
     63     420042 "and"
     64      68162 "ant"
     65     106889 "ap"
     66     303135 "ar"
     67      49715 "ard"
     68     101926 "are"
     69      76713 "art"
     70      41233 "ary"
     71     369990 "as"
     72     272083 "at"
     73     144682 "ate"
     74      35725 "ated"
     75     141460 "ation"
     76      81877 "au"
     77      39551 "av"
     78      52668 "ave"
     79      83408 "ay"
     80     408686 "b"
     81     194719 "be"
     82      57891 "ber"
     83      63093 "bl"
     84      99930 "bo"
     85     103277 "br"
     86      96377 "by"
     87     442428 "c"
     88     221864 "ce"
     89      41350 "cent"
     90     252615 "ch"
     91      86529 "cl"
     92     148011 "co"
     93      42220 "col"
     94     102984 "com"
     95      48096 "comp"
     96     164793 "con"
     97      82703 "cr"
     98     138340 "ct"
     99     855872 "d"
    100     244571 "de"
    101      55783 "der"
    102      62946 "du"
    103    1157749 "e"
    104      51927 "ear"
    105     331298 "ed"
    106     107253 "el"
    107      12352 "ell"
    108     114775 "em"
    109     214216 "en"
    110      21114 "end"
    111     101071 "ent"
    112     258176 "er"
    113      44641 "ere"
    114      40312 "ers"
    115     161112 "es"
    116      17960 "ess"
    117      59694 "est"
    118      83987 "et"
    119      67265 "ex"
    120      18272 "ext"
    121     560200 "f"
    122     176070 "for"
    123      67518 "fr"
    124      57630 "from"
    125     501453 "g"
    126     166424 "ge"
    127     104595 "gr"
    128     116998 "gt"
    129      62217 "gu"
    130     629229 "h"
    131     210843 "he"
    132      67559 "his"
    133      50893 "http"
    134     276990 "i"
    135     110838 "ia"
    136      55903 "ial"
    137      71091 "ian"
    138      78755 "ib"
    139     356898 "ic"
    140      56126 "ical"
    141     218094 "id"
    142     140311 "ie"
    143      74556 "ies"
    144      79288 "if"
    145      70348 "ig"
    146      46693 "ight"
    147      42628 "ign"
    148     172671 "il"
    149      60393 "ill"
    150     137491 "im"
    151     763833 "in"
    152      80695 "ine"
    153     340334 "ing"
    154     250389 "ion"
    155      61788 "ip"
    156     168408 "ir"
    157     374131 "is"
    158      60881 "ish"
    159     105773 "ist"
    160     420457 "it"
    161      67697 "ity"
    162      44490 "iv"
    163     102090 "ive"
    164     159395 "j"
    165     361765 "k"
    166      97227 "ke"
    167     668899 "l"
    168      53014 "ld"
    169     328743 "le"
    170     107760 "lo"
    171     121829 "lt"
    172     132796 "ly"
    173     577372 "m"
    174     227923 "ma"
    175      80451 "man"
    176      71373 "ment"
    177     170128 "mo"
    178     891134 "n"
    179     168242 "ne"
    180      55550 "not"
    181     528365 "o"
    182      57728 "od"
    183     522318 "of"
    184     160243 "ol"
    185      55482 "om"
    186      20516 "ome"
    187     234676 "on"
    188      55724 "one"
    189     110545 "op"
    190     305918 "or"
    191      57598 "ort"
    192      45292 "ory"
    193     102961 "os"
    194      92314 "ot"
    195     146244 "ou"
    196      35202 "oun"
    197      25813 "our"
    198     103786 "ow"
    199     616438 "p"
    200      83723 "pe"
    201      67708 "per"
    202      51940 "ph"
    203      93803 "pl"
    204      67261 "pr"
    205      46974 "pres"
    206     103771 "pro"
    207      66977 "qu"
    208     193734 "quot"
    209     837011 "r"
    210     325434 "re"
    211      47700 "res"
    212      36751 "rev"
    213     120999 "ro"
    214    1502578 "s"
    215     266089 "se"
    216      72956 "sh"
    217     186438 "so"
    218     103277 "sp"
    219     307964 "st"
    220      54180 "str"
    221     138102 "su"
    222     900617 "t"
    223     186038 "te"
    224     118197 "ter"
    225     178076 "th"
    226      85801 "that"
    227     944779 "the"
    228      33973 "ther"
    229      47542 "this"
    230      53818 "tim"
    231     383417 "to"
    232     156639 "tr"
    233     367482 "u"
    234     111854 "ul"
    235      90105 "um"
    236     188192 "un"
    237     192422 "ur"
    238     190448 "us"
    239     142265 "ut"
    240     352798 "v"
    241     111294 "ver"
    242     340997 "w"
    243      86849 "was"
    244     118813 "we"
    245      83322 "wh"
    246      42847 "which"
    247      85862 "with"
    248      52022 "wor"
    249      39476 "www"
    250     148678 "x"
    251     622178 "y"
    252     124570 "z"
    253      39763 "{{"
    254     453312 "|"
    255      40238 "}}"
    
    63516847 strings encoded
    100000000 -> 67650103
    Source code (GPL) and Win32 .exe
    http://mattmahoney.net/dc/wbpe100.zip

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Unfortunately xwrt outperforms wbpe on enwik9. I may try some experiments with larger dictionaries.

    Code:
    wbpe c (v1.00)
    673,370,615 enwik9.c
    
    xwrt -2 -b255 -m255 -s -f64 (v3.2)
    561,080,844 enwik9.xwrt
    
    ppmonstr e -o10 -m1650
    158,462,523 enwik9.pmm
    149,683,690 enwik9.c.pmm
    148,915,761 enwik9.xwrt.pmm
    
    ppmonstr e -o16 -m1700
    157,007,383 enwik9.pmm
    149,706,020 enwik9.c.pmm
    149,192,045 enwik9.xwrt.pmm
    
    ppmd e -m256 -o10 -r1
    183,964,915 enwik9.ppm
    176,631,110 enwik9.c.ppm
    174,879,256 enwik9.xwrt.ppm
    
    zp -m4 -b1024
    165,887,518 enwik9.zpaq
    163,802,788 enwik9.c.zpaq
    161,140,622 enwik9.xwrt.zpaq
    
    bsc -b328 -T
    171,820,075 enwik9.bsc
    167,027,710 enwik9.c.bsc
    167,889,823 enwik9.xwrt.bsc
    
    7zip
    227,905,645 enwik9.7z
    212,801,291 enwik9.c.7z
    200,138,925 enwik9.xwrt.7z
    
    zip -9
    322,592,222 enwik9.zip
    298,221,996 enwik9.c.zip
    264,330,111 enwik9.xwrt.zip

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Is there some bijective word transform?

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    It is not bijective since there are different ways to encode the same string. Of course the coding should be done deterministically.

    Another difference is that xwrt output still looks like text. Spaces are still spaces, for example. Some compressors are tuned for this. Others compress better just because the xwrt output is smaller.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,008 Times in 533 Posts
    @Piotr Tarsa:
    In theory its possible to make it bijective, by assigning some different meanings to word occurrences
    which should have been encoded with a dictionary index, but that would make the preprocessing 10x
    slower without any real compression improvement.

    Afaik, the only real workaround would be a word-wise CM.

  6. #6
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    459
    Thanked 257 Times in 105 Posts
    Some wbpe results using Zhuff v0.7. As a weak LZ compressor, it should benefit most from wbpe preprocessing :

    zhuff
    365,122,888 enwik9.zhf
    318,266,196 enwik9.c.zhf
    318,559,796 enwik9.e.zhf
    which it does....


    Same test, but using ZhuffMax and inikep's xwrt :

    wbpe c (v1.00)
    673,370,615 enwik9.c

    xwrt -2 -b255 -m255 -s -f64 (v3.2)
    561,080,844 enwik9.xwrt2

    xwrt -0 -b255 -m255 -s -f64 (v3.2)
    457,847,842 enwik9.xwrt0

    zhuffmax
    309,866,686 enwik9.zhf
    290,880,885 enwik9.c.zhf
    262,173,096 enwik9.xwrt2.zhf
    232,449,742 enwik9.xwrt0.zhf
    Quite impressive improvement...
    Last edited by Cyan; 14th June 2011 at 02:14.

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Considering bijective word transform, I have an idea

    Progressing through the text stream, on each word we add it to dictionary and assign to it a randomly generated word. From that point replace any occurence of that word with generated word and vice versa.

    What do you think?

    PS:
    We can store entire phrases (say, up to N words in size) in dictionary, not just words.
    Last edited by Piotr Tarsa; 14th June 2011 at 02:23.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,008 Times in 533 Posts
    @Piotr Tarsa:
    It would only work if we'd make a dictionary with all words first, then transform the text without using any literals.
    But if you have to use both literals and dictionary indexes, masking is complicated, so in practice it won't be bijective.

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    No, it won't require having dictionary with all words first.

    Some examples:

    input : mappings (word-word-startingpoint) : output
    a b a : a-b-1 : b a b
    a b c a : a-c-1 d-b-2 : c d a c
    a a a a : a-b-1 : b b b b
    a b b a : a-b-1 : b a a b
    etc


    The algorithm in steps:
    1. Get a word
    2. Check if it is in dictionary
    3. If it's not in dictionary then generate a companion word that is not in dictionary
    4. Replace word with its companion
    5. Go to 1

    Dictionary should be bidirectional, ie dictionary[a] == b if, and only if dictionary[b] == a.

    Edit:
    Wait, this is wrong. We should not replace the first occurence or else we won't be able to guess the original word. So the algorithm should look like:
    1. Get a word
    2. Check if it is in dictionary
    3. If it's not in dictionary then generate a companion word that is not in dictionary
    4. Else replace word with its companion
    5. Go to 1

    Corrected examples:
    input : mappings (word-word-startingpoint) : output
    a b a : a-b-2 : a a b
    a b c a : a-c-3 d-b-2 : a b a c
    a a a a : a-b-1 : a b b b
    a b b a : a-b-1 : a b a b
    etc
    Last edited by Piotr Tarsa; 14th June 2011 at 03:22.

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,008 Times in 533 Posts
    1. If you work with words (ie able to split the input into relatively short strings, such that a suffix of one
    string can't be prefix of another - because of spaces) things certainly get simpler, its basically like
    working with a large alphabet. Surely a simple permutation can be bijective.
    But its not always possible. For example, if there would be some file, like a url list, which won't have much
    spaces, the non-bijective wrt would be able to improve its compression, while a word-wise method won't.
    There're also whole languages without spaces, like japanese.

    2. Surely you can map words onto other words (presumably shorter).
    But what kind of function would map all real words to unique short word-like strings?
    Surely its possible, but it feels the same as what I said before - build
    a dictionary with all words and replace them with indexes.
    Just a static version.

    3. I don't understand your "word-word-startingpoint" example

  11. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    But its not always possible. For example, if there would be some file, like a url list, which won't have much
    spaces, the non-bijective wrt would be able to improve its compression, while a word-wise method won't.
    So we can then split also on dots, not only on spaces Currently I don't have an idea how to generalize it efficiently. Generally (ie in most cases) xwrt should be more efficient on XML or textual data, but it's not obvious as we can use some heuristics as to when generate companions and how to generate companions, also in my method we can reset the dictionary periodically so we can adapt to data.

    But what kind of function would map all real words to unique short word-like strings?
    As I written before - dictionary should be built while progressing through text. The companion word could be any word that is not yet in dictionary.

    I don't understand your "word-word-startingpoint" example
    There's a mapping from one word to second word (and vice versa) and words are swapped starting from startingpoint position in stream.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,008 Times in 533 Posts
    > Currently I don't have an idea how to generalize it efficiently.

    I think its the same problem as LZ optimal parsing.
    Its probably even possible to build a usable dictionary based on lzma's parsing,
    although of course it would be better with precise coding costs based on actual encoding.

    > dictionary should be built while progressing through text

    Sure, that would work, but there's a fairly bad "worst case"
    where after some point we only encounter previously assigned "codewords" and have to expand them.
    Also as I said, in practice we need a way to explicitly specify the length for a new word -
    do you have a bijective algo for that too?

  13. #13
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Matt Mahoney View Post
    It is not bijective since there are different ways to encode the same string. Of course the coding should be done deterministically.

    Another difference is that xwrt output still looks like text. Spaces are still spaces, for example. Some compressors are tuned for this. Others compress better just because the xwrt output is smaller.
    Its not bijective for several reasons. The Table for dictionary would have to be different for one thing. And you would have to be able to handle a special way of getting last byte of file on decode since file code be odd or even byte length.

    You could take care of both of these problems similar to way one does a bijective adaptive huffman compression using 16 bit symbols as tokens regardless of wither the file is of odd or even byte length.

    You have an empty dictionay at start there would not be a symbol for ESC or CAP these can be created on first word added to table. When you finally end up with 254 word strings. You could actually keep tracks of thousands of words but allow the paired codes to only be the most common 254 codes. By the time you get to end of file the 254 words of interest should be the same as Matts but there position on table could be different but that would be of no concern since you use a full byte for the word position in table. Not only would this be bijective but would work on any file. Of course files not well suited would end up longer in length.

  14. #14
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    I posted to quickly there are actually more steps involved. I should have coded it first there are lots of problems I glossed over. Thats what happens when I drink beer and post right after I read it. But I still think I could code this bijectively with some differences maybe closer to huffman and bijective LZW but it can be done. I got good german beer to nite. again SORRY to fast on draw

  15. #15
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 172 Times in 64 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Another difference is that xwrt output still looks like text. Spaces are still spaces, for example. Some compressors are tuned for this. Others compress better just because the xwrt output is smaller.
    I've tried your approach several years ago (using all 0-255 bytes as an output alphabet), but AFAIR it gave worse results even for LZ77. Surprisingly, there are more techniques that make output smaller, but give worse compression after LZ77.
    Last edited by inikep; 14th June 2011 at 10:00.

  16. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Sure, that would work, but there's a fairly bad "worst case"
    where after some point we only encounter previously assigned "codewords" and have to expand them.
    In that case we can do dictionary reset without even outputting any flags. Just apply the same heuristic in encoder and decoder. Of course frequent dictionary reset and assigning different codes will hurt compression with statistical/ contextual algorithms.

    Also as I said, in practice we need a way to explicitly specify the length for a new word -
    do you have a bijective algo for that too?
    All words (both original and generated) are separated by spaces (and/ or dots, whatever) so length could be easily determined.

  17. #17
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    You could have two sets of indexes one set is the strings from characters a-z the other set is strings made of strings not a-z or A-Z when your using the first the string is terminated when a non a-z is encountered. or something like an embedded captial letter. After a string from first set done if you have a CAP the next string is still from the first set and you know the letter is capitalized. If you don't have the ESC symbol the following would be a string from the non a-z or A-Z set.
    When your using a control form the second set when it ends a CAP means that the first set is next and thats its first character is capitalized. if an index appears you still go to first set and the a-z string follows. If ESC follows then you are in the process of building a string for either the first set type or second set type. To keep it bijective you have to prevent two index's from actually using the same string. Yet to be bijective you have to allow the sequence of characters to that make the string occur and yet you build a different string. There are many ways to do this here is one fast way you to assume that the string is a repeat.
    Example notice if you have "a aa aaa" and you assume you already have
    0 being CAP 1 being ESC and 2 being a in set one and you have
    2 being " " for the alternate string
    assume since alternating except when CAP used that the current next sting is of type 1
    then you code above as
    2,2,ESC,a,2,ESC,a

    its just an onging thought.

  18. #18
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    A few comments. First, on building an implicit dictionary dynamically in a single pass, every word in the vocabulary would need to be assigned a code. For a normal Zipf distribution, about half of the dictionary would have words used only once. But each time a word is encoded it increases the output size because you are transmitting information about an arbitrary mapping. In the xwrt experiment, the option -f64 means to encode only words that occur at least 64 times. I tuned the xwrt options for ppmonstr, and increasing the dictionary size makes compression worse.

    Also, I realize that a bijection is theoretically optimal as a compression algorithm but not necessarily as a preprocessor. For a preprocessor you need to preserve semantic boundaries. This is the reason that capitalization modeling works. The symbol denoting capitalization is separated from the dictionary code because it does not change the meaning of the word. For the same reason, common English suffixes like -s, -ed, -ing should be separate symbols because they don't change the meaning of the root word.

    Byte pair encoding does a good job of discovering suffixes like these. I am pretty sure it would also work to discover the vocabularies of languages without spaces like Chinese and Japanese although I couldn't use my hash table optimization. In 1999 I think (I didn't date it) I did some experiments with discovering word boundaries in English text without spaces based solely on n-gram statistics. http://cs.fit.edu/~mmahoney/dissertation/lex1.html

    This is also why combining different types of characters like spaces with letters does not work, even though the preprocessed output is initially smaller.

    Dictionary encoding is not bijective because there are lots of ways of encoding the same strings. For example, if the dictionary has "ab" and "bc", then "abc" could be encoded as "ab","c" or "a","bc" or "a","b","c" or ESC,"a",CAP,"B",CAP,ESC,"c", etc. I think that as long as coding is done in a consistent (predictable) manner then it doesn't hurt compression much. wbpe uses greedy parsing. I haven't tried lookahead parsing but I suspect that making the parsing less predictable might offset some gains.

    I also wonder if xwrt would work better than wbpe using a compression algorithm tuned to the output. It works better for small memory compressors like deflate because the output is smaller, and for high end compressors that have text models. xwrt uses byte values 128-255 as dictionary codes and leaves the rest unchanged, so the output still looks like text. A model for wbpe would need to know which ranges of codes represent spaces, punctuation, digits, and letters, which can vary depending on the input. Both preprocessors are conceptually doing the same thing, replacing whole words with shorter byte sequences, but wbpe also preserves root words, suffixes, and n-gram statistics besides compressing white space and punctuation sequences. wbpe also preserves collating order which can be important for BWT compressors that depend on alphabet order. ASCII is already pretty well optimized for this because characters of the same type are already grouped.

  19. #19
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    In my idea words are replaced starting from second occurence, so in that Zipf distribution, in basic form of my idea, less than half of the words will be replaced.

    We could use another strategies than immediate use of dictionary. For example we can divide input file into segments and transform segment n using dictionary built on segment n - 1.

    The requirements for bijectivity are:
    - when replacing we should replace all words that are in dictionary,
    - there is only one way to build dictionary,
    - no phrase is a prefix of another (in case we're replacing phrases instead of words),
    - and more?

    EDIT:
    Well, with some other constraints it could still be bijective. But you should be able (ie. I'm pretty sure you're smart enough) to tell if some transform is bijective or not.

    But each time a word is encoded it increases the output size because you are transmitting information about an arbitrary mapping.
    Well, that information is generated using a known algorithm with input being the previous data, so the only increase in information should be the generation algorithm. In usual WRT implementations, there is a need to attach static dictionary, that actually contains a lot of information.


    The above strategy that uses dictionaries from previous segments has one big disadvantage: it adapts slowly. With very non-stationary data it could even hurt compression if the word statistics are changing fast from one segment to another.

    On the other hand, immediate use of dictionary (ie. the basic version of my idea) probably could be successfully mixed with PPMII. Recall that PPMII initializes new contex of order n using statistics from a few lower orders. When combining my WRT with PPMII we can initialize statistics for new companion word with statistics from original word.

    BTW:
    Matt, do you use information inheritance in your PAQ algo?
    Last edited by Piotr Tarsa; 15th June 2011 at 00:22.

  20. #20
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Also, I realize that a bijection is theoretically optimal as a compression algorithm but not necessarily as a preprocessor. For a preprocessor you need to preserve semantic boundaries.
    Well, as far as I can see for optimality it should be paired with a non-injective compressor because that's the only way you can get bijectivity of the sum of transforms.
    Do I miss something?

  21. #21
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    LZW also builds a dynamic dictionary and only codes words when they are seen for the second time. LZW compression isn't very good but I suppose that's because current models aren't very good and nobody bothered to work on it during the long period when it was patented. I suppose a variation of LZW could look ahead during compression and write codes to indicate whether a word should be added to the dictionary.

    PAQ does not use information inheritance. High and low order statistics are learned at the same time and mixed.

    Yes, a non-injective compressor could be combined with a non-surjective preprocessor to make a bijection.

  22. #22
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    LZW isn't bijective at all, in the first place.

  23. #23
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I think LZW could be made bijective if the dictionary codes were Huffman coded (with some tricky EOF handling) instead of rounding up the number of bits and sub-optimal parses were excluded.

  24. #24
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    I once had an idea about bijective LZW, but that was such a long time ago that I've forgotten that. Anyway, computing such transform would be rather slow and unsuitable for integration with other algorithms. Also you've mentioned Huffman codes, so with such codes it wouldn't be good as an preprocessing step for byte based algorithms.

  25. #25
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Yes, but another modification of LZW would be to output byte aligned codes. That is, when the dictionary is less than 64K all codes are either 1 or 2 bytes. This seems to be like the algorithm you are describing. It would not be bijective because not all of the code space could be used except when the set of allowed choices from the dictionary is 1 (mod 255). A choice is allowed if there is not a longer match from the dictionary.

  26. #26
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    LZW isn't bijective at all, in the first place.
    actually I wrote some some BIJECTIVE LZW coders and posted them on the net around 2003 I only posted the 2 root versions but had a 4 root version somewhere. I also spent a lot of time on a 256 root one but it was slow and required to much memory for even a small tree. if you go to below you can see what I posted years ago.

    http://bijective.dogma.net/compreslzw1.htm

    The executables no longer run on my current machine it was all for DOS. But I have on my current machine complied much of the old stuff using MinGW with no problems.

    They were coded just for the fun of it. Each time you go through the table when you
    get to a leaf the old string gets replaced with itself plus a zero and its self plus a one.
    when I run out of space I have a complete huffman tree built by the method and I use it for the rest of file.

    Lots of time when I get something working I lose interest in going farther. Its when the end is uncertain that drives me farther. When the end is in easy reach I get bored.

    The ones I posted where binary where you go a bit at a time and yes it worked for all files so it was bijective. For the 4 root one you go two bits at a time and if I remember corectly you only created one new node each time except when you get stop on the short node 3 times you finally get the 4 leaves and the base one goes away. The problem with 8 bit symbols is you have to wait for 255 follow on symbols before that string goes a way and you have the 256 replacement paths. It took a lot of space and book keeping to store all the enough for each string used.

  27. #27
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Matt Mahoney View Post
    LZW also builds a dynamic dictionary and only codes words when they are seen for the second time. LZW compression isn't very good but I suppose that's because current models aren't very good and nobody bothered to work on it during the long period when it was patented. I suppose a variation of LZW could look ahead during compression and write codes to indicate whether a word should be added to the dictionary.

    PAQ does not use information inheritance. High and low order statistics are learned at the same time and mixed.

    Yes, a non-injective compressor could be combined with a non-surjective preprocessor to make a bijection.
    Actually if you count the the starting single bytes as ROOTS meaning you think of them as numbers. Then LZW only writes out numbers and nothing else. The code book starts with only those numbers defined. As you compress the file all you do is write for each string a single number. When you write that single number you store in table a value for the old string plus the last character read. As you build new strings there is no writting of the new character in file. Its really neat. You can start off with the base numbers 0 to 255 to for each root. Then every time you write a single number you incease the span of numbers by 1
    you use a single byte at start for the 256 cases. Then you use huffman table of 257 values 255 of which are 8 bits and 2 which are of 9 bits. As this goes on you automatically use more space. Mark uses a fixed filed length for assigning codes I think that is wasteful.
    You can immediately see where even this version of coding is not bijective. Suspose you build in the file every two character string that starts with "A" then when you compress you never need to use the value for "A" again since you have the codes for every character after "A"
    The best basic LZW stuff is at

    http://marknelson.us/1989/10/01/lzw-data-compression/


    I am not good at words. But even though you say "Yes, a non-injective compressor could be combined with a non-surjective preprocessor to make a bijection" and what you say is true. I still think its best to twist definations so that each stage is bijective in a certain since. Example I am only looking at all files. I could map bijectively to a file that only contained 3 types of characters A B C the file would be much longer but its not really a bijection of all files to all files. If you look at the code I did for the bijective bwt dc thing the various layers are not bijective from files to files but. The overall use of each program makes it bijective end to end to files.

  28. #28
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I did some experiments to test which compressors are optimized for text. I ran 3 transforms on book1.

    book3 - each byte is multiplied by 3. (Inverse is to multiply by 171).
    booka - add 128 to each byte.
    bookr - replace each byte with 255 - byte.

    I then compressed it with all of the top ranked programs on LTCB as well as many common programs and a wide variety of programs with unusual algorithms.

    I found that nearly every compression algorithm does worse on book3. The only algorithm not affected is LZW (compress, lzrw5). It's not because they have text models, but (I think) because the character range is more spread out and interleaved. It is easy to see how this affects BWT and ST algorithms where alphabet order is important, but not so easy for CM, PPM, LZ77, LZMA, LZP, CTW, DMC, SR, etc. Occasionally it helps for LZ77 but most times it is worse just like the others.

    The other transforms mostly preserve alphabet order but reverse or rotate it. This affects only algorithms with explicit text models, which include nanozip, paq8px, slim, zp -m4 (zpaq max.cfg), xwrt, lpaq1, durilca, paq9a. It also seems that ppmd and ppmonstr are affected, although from the algorithm descriptions you might not think so. (I believe that SEE models some high order bits). The top ranked programs that are not affected are cmm4, zp -m3 (zpaq mid.cfg), ctw, bit, 7zip, zip, and most of the BWT and ST algorithms like m03, bcm, bsc, bbb, dark, zp -m2, szip, bzip2. flashzip unexpectedly improved.

    Code:
    // To transform: a < book1 > book3
    #include <stdio.h>
    #include <stdlib.h>
    #include <fcntl.h>
    int main() {
      setmode(0, O_BINARY);
      setmode(1, O_BINARY);
      int c;
      while ((c=getchar())!=EOF)
        putchar(c*3);  // or (c+128) or (255-c)
      return 0;
    }
    
    /*
    
    768,771 book1  original
    768,771 book3  c*3 (inverse = c*171)
    768,771 booka  c+128
    768,771 bookr  255-c
    
    nz -cc (CM)
    189,501 book1.nz
    208,548 book3.nz
    206,517 booka.nz
    206,552 bookr.nz
    
    paq8px_v69 -6 (CM)
    191,784 book1.paq8px
    202,457 book3.paq8px
    200,988 booka.paq8px
    200,990 bookr.paq8px
    
    slim23d (PPM)
    192,181 book1.slim
    205,547 book3.slim
    204,745 booka.slim
    204,769 bookr.slim
    
    nz (BWT)
    198,960 book1.nz
    219,602 book3.nz
    218,342 booka.nz
    218,274 bookr.nz
    
    zp -m4 (CM, zpaq cmax)
    199,478 book1.zpaq
    207,150 book3.zpaq
    204,822 booka.zpaq
    204,819 bookr.zpaq
    
    xwrt -l12 (CM, dict+lpaq6)
    202,057 book1.xwrt
    210,767 book3.xwrt
    218,565 booka.xwrt
    218,865 bookr.xwrt
    
    lpaq1 -6 (CM)
    202,662 book1.lpaq1
    208,007 book3.lpaq1
    206,549 booka.lpaq1
    206,354 bookr.lpaq1
    
    durilca v0.5 (PPM)
    203,208 book1.dur
    206,152 book3.dur
    205,853 booka.dur
    205,848 bookr.dur
    
    ppmonstr (PPM -o12)
    203,247 book1.pmm
    206,160 book3.pmm
    205,806 booka.pmm
    205,802 bookr.pmm
    
    paq9a (LZP+CM)
    203,322 book1.paq9a
    210,579 book3.paq9a
    208,123 booka.paq9a
    208,222 bookr.paq9a
    
    cmm4 76 (CM)
    208,426 book1.cmm4
    210,925 book3.cmm4
    208,396 booka.cmm4
    208,406 bookr.cmm4
    
    zp -m3 (CM, zpaq cmid)
    208,704 book1.zpaq
    210,682 book3.zpaq
    208,687 booka.zpaq
    208,723 bookr.zpaq
    
    ctw -d6 -n16M -f16M (CTW)
    209,514 book1.ctw
    211,321 book3.ctw
    209,514 booka.ctw
    209,513 bookr.ctw
    
    m03 e 1000000 (BWT)
    210,483 book1.m03
    210,941 book3.m03
    210,483 booka.m03
    210,362 bookr.m03
    
    ppmd -o16 -m256 -r1 (PPM)
    210,495 book1.pmd
    212,808 book3.pmd
    213,003 booka.pmd
    213,007 bookr.pmd
    
    bcm (BWT)
    210,794 book1.bcm
    213,603 book3.bcm
    210,954 booka.bcm
    211,045 bookr.bcm
    
    bsc -p (BWT, no preprocessing)
    212,534 book1.bsc
    213,803 book3.bsc
    212,534 booka.bsc
    212,510 bookr.bsc
    
    bsc -cf (BWT, following contexts)
    212,534 book1.bsc
    213,803 book3.bsc
    212,534 booka.bsc
    212,510 bookr.bsc
    
    bbb (BWT)
    213,162 book1.bbb
    215,162 book3.bbb
    213,145 booka.bbb
    213,123 bookr.bbb
    
    bsc -cp (BWT, preceding contexts)
    213,324 book1.bsc
    215,120 book3.bsc
    213,324 booka.bsc
    213,277 bookr.bsc
    
    bit -p=5 (CM)
    213,604 book1.bit
    216,842 book3.bit
    213,595 booka.bit
    213,633 bookr.bit
    
    dark (BWT)
    214,423 book1.dark
    215,918 book3.dark
    214,423 booka.dark
    214,530 bookr.dark
    
    zp -m2 (BWT, zpaq cbwt2)
    215,290 book1.zpaq
    217,127 book3.zpaq
    215,312 booka.zpaq
    215,145 bookr.zpaq
    
    ppmd (PPM -o4)
    216,021 book1.pmd
    216,410 book3.pmd
    216,304 booka.pmd
    216,304 bookr.pmd
    
    ccmx 7 (CM)
    221,486 book1.ccmx
    224,912 book3.ccmx
    222,053 booka.ccmx
    222,140 bookr.ccmx
    
    szip (ST6)
    226,910 book1.szip
    228,371 book3.szip
    226,910 booka.szip
    226,843 bookr.szip
    
    hook c 256 (DMC)
    230,973 book1.hook
    232,951 book3.hook
    231,119 booka.hook
    230,919 bookr.hook
    
    zp -m1 (BWT, zpaq cbwtrle1)
    231,912 book1.zpaq
    233,982 book3.zpaq
    231,721 booka.zpaq
    231,629 bookr.zpaq
    
    bzip2 (BWT)
    232,598 book1.bz2
    233,666 book3.bz2
    232,607 booka.bz2
    232,568 bookr.bz2
    
    bsc -cf -m0 (ST3)
    241,551 book1.bsc
    242,424 book3.bsc
    241,551 booka.bsc
    241,539 bookr.bsc
    
    flashzip a -m3 -c7 -b7 (LZ77)
    258,194 book1.fz
    258,361 book3.fz
    254,819 booka.fz
    254,923 bookr.fz
    
    7zip (LZMA)
    261,064 book1.7z
    263,472 book3.7z
    261,064 booka.7z
    261,174 bookr.7z
    
    sr2 (SR)
    276,364 book1.sr2
    278,293 book3.sr2
    276,389 booka.sr2
    276,369 bookr.sr2
    
    kzip
    299,545 book1.zip
    299,575 book3.zip
    299,521 booka.zip
    299,953 bookr.zip
    
    rar v2.50 (LZ77?)
    305,906 BOOK1.RAR
    306,176 BOOK3.RAR
    305,907 BOOKA.RAR
    305,905 BOOKR.RAR
    
    thor e5 (LZ77)
    306,144 book1.thor
    306,144 book3.thor
    306,144 booka.thor
    306,148 bookr.thor
    
    slug (ROLZ)
    309,919 book1.slug
    308,363 book3.slug
    310,936 booka.slug
    310,592 bookr.slug
    
    tornado (LZ77)
    311,145 book1.tor
    311,353 book3.tor
    311,129 booka.tor
    311,145 bookr.tor
    
    gzip -9 (LZ77)
    312,281 book1.gz
    312,465 book3.gz
    312,282 booka.gz
    312,281 bookr.gz
    
    zip (LZ77)
    313,602 book1.zip
    313,762 book3.zip
    313,603 booka.zip
    313,602 bookr.zip
    
    compress (LZW)
    335,033 BOOK1.Z
    335,033 BOOK3.Z
    335,033 BOOKA.Z
    335,033 BOOKR.Z
    
    srank (SR)
    390,144 book1.sr
    390,638 book3.sr
    390,608 booka.sr
    390,438 bookr.sr
    
    lzrw3-a (LZ77)
    416,135 book1.lzrw3-a
    413,844 book3.lzrw3-a
    416,560 booka.lzrw3-a
    414,290 bookr.lzrw3-a
    
    bpe 5000 4096 200 3 (BPE)
    432,845 book1.bpe
    439,562 book3.bpe
    432,861 booka.bpe
    433,383 bookr.bpe
    
    fpaq0 (o0)
    435,227 book1.fpaq0
    435,363 book3.fpaq0
    435,272 booka.fpaq0
    435,297 bookr.fpaq0
    
    lzrw5 (LZW)
    444,560 book1.lzrw5
    444,560 book3.lzrw5
    444,560 booka.lzrw5
    444,560 bookr.lzrw5
    
    lzop (LZ77)
    477,431 book1.lzo
    476,937 book3.lzo
    477,382 booka.lzo
    476,962 bookr.lzo
    
    lzrw1 (LZ77)
    522,274 book1.lzrw1
    520,777 book3.lzrw1
    523,183 booka.lzrw1
    524,330 bookr.lzrw1
    
    thor e1 (LZP)
    535,114 book1.thor
    532,958 book3.thor
    556,350 booka.thor
    556,350 bookr.thor
    
    ppp (SR)
    552,507 book1.ppp
    545,075 book3.ppp
    552,507 booka.ppp
    552,507 bookr.ppp
    
    flzp (LZP)
    566,253 book1.flzp
    566,269 book3.flzp
    566,253 booka.flzp
    566,253 bookr.flzp
    
    */

  29. #29
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Matt:
    IIRC Sami Runsas did a similiar experiment.

  30. #30
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Yes, but only for BWT based compressors. http://compressionratings.com/bwt.html

Page 1 of 2 12 LastLast

Similar Threads

  1. convert swf files to avi files
    By Jabilo in forum The Off-Topic Lounge
    Replies: 13
    Last Post: 26th October 2016, 12:39
  2. Dummy Static [Windowless] Dictionary Text Decompressor
    By Sanmayce in forum Data Compression
    Replies: 4
    Last Post: 12th October 2010, 19:55
  3. Experiments with small dictionary coding
    By Matt Mahoney in forum Data Compression
    Replies: 13
    Last Post: 28th June 2010, 17:34
  4. Text Detection
    By Simon Berger in forum Data Compression
    Replies: 15
    Last Post: 30th May 2009, 10:58
  5. Large text benchmark
    By Matt Mahoney in forum Forum Archive
    Replies: 39
    Last Post: 13th January 2008, 01:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •