Results 1 to 14 of 14

Thread: Experiments with small dictionary coding

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts

    Experiments with small dictionary coding

    I have been doing some experiments using small dictionaries to improve text compression. The idea is to replace common strings with one byte codes to decrease the size (and memory usage) of the file to be compressed. This helps with many poor to average compressors but hurts compression with the best ones because they already have models for text and they no longer recognize the input. Here are some results for book1 (768 KB) and enwik8 (100 MB). Notice that preprocessing helps more on bigger files. The 3 columns are compressed size without preprocessing, compressed size with preprocessing, and the ratio. A ratio less than 1 means that preprocessing helped.

    Code:
    book1* 16,206,860 (total compressed size without preprocessing)
    b1bse* 14,777,645 (total compressed size with preprocessing)
    
    book1.nz_cc          189501    206170  1.0880
    book1.paq8px         191784    199639  1.0410
    book1.paq8l          192368    199944  1.0394
    book1.nz             198960    218981  1.1006
    book1.zpaq_c3        199465    206474  1.0351
    book1.paq6           200863    205264  1.0219
    book1.lpaq1          202539    205958  1.0169
    book1.pmm            203247    203893  1.0032
    book1.paq9a          203322    207463  1.0204
    book1.lpaq9m         206201    212308  1.0296
    book1.cmm4           208423    208613  1.0009
    book1.zpaq_c2        208691    210052  1.0065
    book1.paq1           209195    220327  1.0532
    book1.ctw            209514    212013  1.0119
    book1.bbb            213162    216746  1.0168
    book1.grzip          215353    215404  1.0002
    book1.pmd            216021    208872  0.9669
    book1.pms            216233    219423  1.0148
    book1.p6             219434    223488  1.0185
    book1.bwt.fpaq0f2    220338    225362  1.0228
    book1.szip           225639    224602  0.9954
    book1.ACB            225653    232937  1.0323
    book1.zpaq_c1        226461    226543  1.0004
    book1.bz2            232598    228332  0.9817
    book1.lzpxj          237207    224591  0.9468
    book1.dmc            237997    233018  0.9791
    book1.bwt.fpaq0p     244106    249549  1.0223
    book1.tarsalzp       246947    240986  0.9759
    book1.7z             261064    252230  0.9662
    book1.cab            264144    255722  0.9681
    book1.sr2            276364    266746  0.9652
    book1.RAR            301298    283515  0.9410
    book1.gz             312281    295220  0.9454
    book1.HA             312391    290902  0.9312
    book1.zip            312502    295441  0.9454
    book1.Z              335033    320149  0.9556
    book1.fcm1           354460    290557  0.8197
    book1.lzo            360592    319705  0.8866
    book1.sr             370592    367564  0.9918
    book1.lzrw3-a        416135    370959  0.8914
    book1.fpaq0f2        424522    386319  0.9100
    book1.bpe            432845    427119  0.9868
    book1.fpaq0          435227    384986  0.8846
    book1.fpaq0p         442501    388102  0.8771
    book1.lzrw5          444560    401756  0.9037
    book1.fastlz         480393    406263  0.8457
    book1.lzrw2          490239    403994  0.8241
    book1.lzrw1-a        522189    417274  0.7991
    book1.ppp            552507    422133  0.7640
    book1.flzp           566253    440837  0.7785
    book1                768771    501598  0.6525
    book1.bwt            768775    501602  0.6525
    
    enwik8* 1,851,111,733
    en8bse* 1,669,465,145
    
    enwik8.paq8px       18293940  18368792  1.0041
    enwik8.paq8l        18518485  18443403  0.9959
    enwik8.nz_cc        18826931  19172784  1.0184
    enwik8.lpaq9m       19072743  19116075  1.0023
    enwik8.zpaq_c3      19448650  19452856  1.0002
    enwik8.pmm          19701161  19081217  0.9685
    enwik8.lpaq1        19796957  19388356  0.9794
    enwik8.paq9a        20129573  19966982  0.9919
    enwik8.paq6         20303336  20366657  1.0031
    enwik8.cmm4         20548514  19648208  0.9562
    enwik8.zpaq_c2      20941558  19966752  0.9535
    enwik8.nz           20948832  21077069  1.0061
    enwik8.bwt.fpaq0f2  21798843  21958963  1.0073
    enwik8.paq1         22156982  22597001  1.0199
    enwik8.bwt.fpaq0p   23809591  23781162  0.9988
    enwik8.grzip        23846878  23387089  0.9807
    enwik8.bbb          24576921  24195409  0.9845
    enwik8.zpaq_c1      24837469  22340732  0.8995
    enwik8.tarsalzp     25134862  23510970  0.9354
    enwik8.lzpxj        25251404  22663472  0.8975
    enwik8.p6           25377998  24415182  0.9621
    enwik8.ctw          25453025  25583219  1.0051
    enwik8.7z           25895909  24748071  0.9557
    enwik8.szip         26120472  25532125  0.9775
    enwik8.pmd          26275353  24568527  0.9350
    enwik8.pms          26310248  26023665  0.9891
    enwik8.dmc          28402672  27693440  0.9750
    enwik8.cab          28465607  27130867  0.9531
    enwik8.bz2          29008758  27830200  0.9594
    enwik8.sr2          30432506  28407883  0.9335
    enwik8.RAR          35107917  33215137  0.9461
    enwik8.HA           36379137  34227814  0.9409
    enwik8.gz           36445248  34590172  0.9491
    enwik8.zip          36445470  34590394  0.9491
    enwik8.lzo          41217688  37332065  0.9057
    enwik8.sr           43091439  40884791  0.9488
    enwik8.fcm1         45402225  36521399  0.8044
    enwik8.Z            45763941  43314761  0.9465
    enwik8.lzrw3-a      48009194  43018119  0.8960
    enwik8.bpe          53906667  53214457  0.9872
    enwik8.fastlz       54658924  46853401  0.8572
    enwik8.lzrw2        55360907  46927138  0.8477
    enwik8.fpaq0f2      56916872  53589513  0.9415
    enwik8.flzp         57366279  47492203  0.8279
    enwik8.lzrw5        59375192  53659758  0.9037
    enwik8.lzrw1-a      59471657  48941846  0.8229
    enwik8.fpaq0p       61457810  55334754  0.9004
    enwik8.ppp          61657971  49917765  0.8096
    enwik8.fpaq0        63391013  57414840  0.9057
    enwik8             100000000  69003843  0.6900
    enwik8.bwt         100000004  69003847  0.6900
    Decoding is fast and simple. For example, enwik8 decodes in 1.2 seconds. The decoding rules are:

    0 c n s1 s2 ... sn = assign n-byte string s1...sn to code c, where c = 3...255
    1 c = output c (escape code)
    2 = change the case of the next output byte (toggle bit 5)
    c in 3...255 = output the string assigned to code c. Initially each code decodes to itself.

    Initially I wrote the encoder to run in 1 pass. It counted the number of times each code or each pair of codes was used. It used greedy parsing. It would look for the longest match to the dictionary and output the corresponding code. If there was no match then it would output an escape code and 1 byte. It would also try changing the case of the first byte and look for a match. If found, it would emit a change-case code (2) followed by a dictionary code. However it would use a direct code of the same length if given a choice.

    New codes are formed by concatenating the strings assigned to the last two codes emitted. The encoder calculates if this would have saved space in the decoded output by counting for each code and each pair of codes how many times it was emitted, multiplied by the length of the encoded strings. If there is a savings in replacing the least used code, then it is replaced and a replacement code is emitted to keep the decoder in sync.

    I discovered that the following variations improve compression.
    1. Restrict dictionary codes to strings of characters of the same category. Categories are letters, digits, white space, and others. All other values form separate categories. For example, Abc, 123, ---, ###, are all valid dictionary codes, but #-# is not.
    2. Restrict the maximum string length to 28. Most codes are smaller anyway. Up to 255 would be possible. This helps when there is text mixed with binary data because it prevents the dictionary from being filled up with different codes for strings of zero bytes of different lengths.
    3. Use 2 passes. In the first pass, the dictionary is built. In the second pass, the final dictionary is output at the beginning and not changed.
    4. Sort the dictionary. This helps most for BWT compressors.

    Below is an example dictionary as constructed for book1. The first column is the assigned code. The second is the count from the dictionary building step (EDIT: before sorting, that is a bug. The number is really meaningless). You will notice that there are no codes for "q", "z", or many of the less frequent uppercase letters. These would be escaped.

    Code:
      3      187 "^J"
      4      281 " "
      5      375 """
      6      421 "'"
      7     1191 "+"
      8     1034 ","
      9      431 "-"
     10    16622 "--"
     11      117 "."
     12      553 ";"
     13      388 "<"
     14      176 ">"
     15      252 "A"
     16      114 "C"
     17      326 "D"
     18       77 "E"
     19      311 "Ga"
     20      128 "Gabriel"
     21     1338 "H"
     22      724 "I"
     23      306 "N"
     24      347 "O"
     25      378 "Oak"
     26      529 "P"
     27      271 "R"
     28      429 "S"
     29     1871 "T"
     30      980 "a"
     31      167 "abou"
     32   125551 "ac"
     33      436 "ack"
     34      850 "ad"
     35      236 "ain"
     36       29 "ak"
     37     1162 "al"
     38      387 "all"
     39     6470 "am"
     40      188 "an"
     41      984 "ance"
     42     1984 "and"
     43      691 "ant"
     44    10296 "ar"
     45     1229 "ard"
     46     7170 "as"
     47      293 "ass"
     48      456 "at"
     49      103 "ate"
     50      932 "ation"
     51      315 "ay"
     52      202 "b"
     53     1454 "be"
     54      979 "been"
     55      230 "being"
     56      195 "bet"
     57      251 "ble"
     58      323 "bou"
     59      762 "bout"
     60      498 "bri"
     61      500 "briel"
     62      498 "by"
     63      920 "c"
     64      825 "cas"
     65      685 "ce"
     66      319 "ch"
     67      509 "co"
     68      269 "con"
     69      443 "ct"
     70      850 "cu"
     71      689 "d"
     72      799 "de"
     73     2871 "di"
     74      220 "e"
     75     1361 "ea"
     76       16 "ect"
     77      462 "ed"
     78      374 "el"
     79      431 "en"
     80      633 "ence"
     81      478 "end"
     82      222 "ent"
     83      421 "er"
     84      792 "ere"
     85      491 "es"
     86     1112 "f"
     87      319 "for"
     88      488 "fro"
     89     1418 "from"
     90      326 "g"
     91      307 "ge"
     92     1204 "gh"
     93      337 "ght"
     94      438 "h"
     95     2070 "ha"
     96      309 "had"
     97     7619 "hand"
     98     4068 "have"
     99     3777 "he"
    100     8471 "her"
    101     8022 "here"
    102     5965 "hi"
    103     3326 "him"
    104     3396 "his"
    105     3450 "i"
    106      468 "id"
    107     2431 "ight"
    108     4107 "il"
    109     3687 "in"
    110     5814 "ind"
    111     4423 "ing"
    112     4393 "int"
    113     1486 "into"
    114     3940 "ion"
    115    11958 "ir"
    116     7880 "is"
    117     5986 "ist"
    118     1277 "it"
    119     3479 "its"
    120      861 "ity"
    121     6107 "iv"
    122      742 "j"
    123      462 "k"
    124      522 "ke"
    125      126 "ked"
    126      363 "king"
    127      309 "l"
    128      614 "ld"
    129     1269 "le"
    130      445 "les"
    131      256 "li"
    132      585 "light"
    133      492 "line"
    134      575 "ll"
    135      270 "lo"
    136     1367 "loo"
    137      548 "ly"
    138      944 "m"
    139      248 "man"
    140      360 "me"
    141      937 "med"
    142     1211 "men"
    143      407 "ment"
    144     1642 "mi"
    145      555 "mo"
    146      392 "more"
    147      332 "mp"
    148      333 "n"
    149     1837 "nd"
    150      439 "ne"
    151     1213 "ng"
    152     2468 "ning"
    153      605 "not"
    154     1246 "now"
    155     1352 "o"
    156     1545 "od"
    157      346 "of"
    158      632 "ol"
    159     2962 "on"
    160      513 "one"
    161      291 "oo"
    162      756 "op"
    163      876 "open"
    164      764 "or"
    165      595 "ot"
    166     1587 "other"
    167      430 "ou"
    168      512 "ould"
    169      226 "oun"
    170      220 "our"
    171      659 "out"
    172      278 "over"
    173      282 "ow"
    174     1658 "own"
    175      416 "p"
    176      614 "par"
    177     1038 "pe"
    178     1092 "per"
    179      641 "po"
    180     1605 "pre"
    181      204 "pro"
    182       22 "r"
    183      326 "ra"
    184     1341 "ran"
    185     1744 "re"
    186      291 "rea"
    187      435 "rem"
    188     1899 "res"
    189     1263 "ri"
    190      536 "rn"
    191     2212 "ro"
    192     1775 "round"
    193      552 "rs"
    194      340 "ru"
    195      369 "s"
    196     1246 "sa"
    197     2045 "said"
    198     1163 "se"
    199      294 "see"
    200      270 "she"
    201      716 "si"
    202      261 "sid"
    203     1429 "side"
    204     1207 "sing"
    205      204 "small"
    206      261 "so"
    207      908 "some"
    208     3237 "st"
    209      173 "t"
    210      642 "ta"
    211     2065 "ter"
    212      318 "th"
    213     2548 "than"
    214      742 "that"
    215      897 "the"
    216     1158 "them"
    217     4441 "ther"
    218      440 "there"
    219      442 "this"
    220      846 "thou"
    221      581 "ti"
    222     1430 "time"
    223     1891 "ting"
    224     1309 "tion"
    225     1147 "to"
    226     1159 "tu"
    227     3020 "ty"
    228      592 "u"
    229     2118 "uch"
    230      874 "un"
    231      137 "und"
    232      963 "upon"
    233     1989 "ure"
    234      349 "ut"
    235     2764 "v"
    236     4545 "ve"
    237      732 "ver"
    238      515 "w"
    239      231 "wa"
    240      184 "was"
    241     3961 "way"
    242     2030 "we"
    243      169 "were"
    244      764 "wh"
    245     2658 "what"
    246     1160 "whe"
    247     8879 "when"
    248     1000 "whi"
    249     4155 "which"
    250     2044 "wind"
    251     2127 "with"
    252     1318 "wor"
    253     1487 "x"
    254      657 "y"
    255     1466 "you"
    Code is attached (GPL). The parsing is stupidly inefficient because it is still experimental. Encoding enwik8 takes about 200 seconds. Expect future improvements.
    Attached Files Attached Files
    Last edited by Matt Mahoney; 19th June 2010 at 20:00.

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,502
    Thanks
    741
    Thanked 664 Times in 358 Posts
    dict -1 does more or less the same. but using escape code looks interesting idea for bwt
    Last edited by Bulat Ziganshin; 19th June 2010 at 19:58.

  3. #3
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    Hi

    Since you have make the topic, I'd like to share my recently wrote english txt preprocessor(already used in the newest csc).

    It uses a static small dictionary, simpler than yours, and a little worse on some cases, but much faster.

    Enwik8 encodes in 3s and decodes in 1.8s on 2.1GHz athlon 4000+

    You can make some quick tests if you are interested.
    Attached Files Attached Files

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Yes, your program is much faster. It also works better than mine on the better compressors. Here is a comparison on book1. First column is preprocessed with my bse and second column with your TXTfilter. It doesn't hurt compression even though your program pads the output with spaces to make it the same size as the input. Last column is the ratio. Less than 1 means that TXTFilter helps more (or hurts less) than bse.

    Code:
    b1bse.paq8px         199639    192477  0.9641
    b1bse.paq8l          199944    193234  0.9664
    b1bse.pmm            203893    202338  0.9924
    b1bse.paq6           205264    201895  0.9836
    b1bse.lpaq1          205958    204112  0.9910
    b1bse.nz_cc          206170    203205  0.9856
    b1bse.zpaq_c3        206474    204487  0.9904
    b1bse.paq9a          207463    205834  0.9921
    b1bse.cmm4           208613    206909  0.9918
    b1bse.pmd            208872    207685  0.9943
    b1bse.zpaq_c2        210052    207865  0.9896
    b1bse.ctw            212013    210213  0.9915
    b1bse.lpaq9m         212308    208317  0.9812
    b1bse.grzip          215404    214693  0.9967
    b1bse.bbb            216746    215583  0.9946
    b1bse.nz             218981    217753  0.9944
    b1bse.pms            219423    217605  0.9917
    b1bse.paq1           220327    217070  0.9852
    b1bse.p6             223488    220599  0.9871
    b1bse.lzpxj          224591    224487  0.9995
    b1bse.szip           224602    223405  0.9947
    b1bse.bwt.fpaq0f2    225362    223189  0.9904
    b1bse.zpaq_c1        226543    224483  0.9909
    b1bse.bz2            228332    227088  0.9946
    b1bse.ACB            232930    232922  1.0000
    b1bse.dmc            233018    232994  0.9999
    b1bse.tarsalzp       240986    238926  0.9915
    b1bse.bwt.fpaq0p     249549    251258  1.0068
    b1bse.7z             252230    252393  1.0006
    b1bse.cab            255722    257458  1.0068
    b1bse.sr2            266746    266843  1.0004
    b1bse.RAR            283515    284943  1.0050
    b1bse.fcm1           290557    295633  1.0175
    b1bse.HA             290902    292967  1.0071
    b1bse.gz             295220    297465  1.0076
    b1bse.zip            295446    297686  1.0076
    b1bse.lzo            319705    325961  1.0196
    b1bse.Z              320149    322907  1.0086
    b1bse.sr             367564    364787  0.9924
    b1bse.lzrw3-a        370959    405793  1.0939
    b1bse.fpaq0          384986    408745  1.0617
    b1bse.fpaq0f2        386319    392716  1.0166
    b1bse.fpaq0p         388102    401828  1.0354
    b1bse.lzrw5          401756    409700  1.0198
    b1bse.lzrw2          403994    441498  1.0928
    b1bse.fastlz         406263    415559  1.0229
    b1bse.lzrw1-a        417274    457753  1.0970
    b1bse.ppp            422133    463669  1.0984
    b1bse.bpe            427119    409892  0.9597
    b1bse.flzp           440837    447225  1.0145
    b1bse                501598    768771  1.5326
    b1bse.bwt            501602    768775  1.5326
    The second table compares TXTFilter with no preprocessing. First column is book1 compressed with no preprocessing. Second column is preprocessed with TXTFilter. Third column is the ratio. Less than 1 means that TXTFilter improves compression.

    Code:
    book1.nz_cc          189501    203205  1.0723
    book1.paq8px         191784    192477  1.0036
    book1.paq8l          192368    193234  1.0045
    book1.nz             198960    217753  1.0945
    book1.zpaq_c3        199465    204487  1.0252
    book1.paq6           200863    201895  1.0051
    book1.lpaq1          202539    204112  1.0078
    book1.pmm            203247    202338  0.9955
    book1.paq9a          203322    205834  1.0124
    book1.lpaq9m         206201    208317  1.0103
    book1.cmm4           208423    206909  0.9927
    book1.zpaq_c2        208691    207865  0.9960
    book1.paq1           209195    217070  1.0376
    book1.ctw            209514    210213  1.0033
    book1.bbb            213162    215583  1.0114
    book1.grzip          215353    214693  0.9969
    book1.pmd            216021    207685  0.9614
    book1.pms            216233    217605  1.0063
    book1.p6             219434    220599  1.0053
    book1.bwt.fpaq0f2    220338    223189  1.0129
    book1.szip           225639    223405  0.9901
    book1.ACB            225653    232922  1.0322
    book1.zpaq_c1        226461    224483  0.9913
    book1.bz2            232598    227088  0.9763
    book1.lzpxj          237207    224487  0.9464
    book1.dmc            237997    232994  0.9790
    book1.bwt.fpaq0p     244106    251258  1.0293
    book1.tarsalzp       246947    238926  0.9675
    book1.7z             261064    252393  0.9668
    book1.cab            264144    257458  0.9747
    book1.sr2            276364    266843  0.9655
    book1.RAR            301298    284943  0.9457
    book1.gz             312281    297465  0.9526
    book1.HA             312391    292967  0.9378
    book1.zip            312502    297686  0.9526
    book1.Z              335033    322907  0.9638
    book1.fcm1           354460    295633  0.8340
    book1.lzo            360592    325961  0.9040
    book1.sr             370592    364787  0.9843
    book1.lzrw3-a        416135    405793  0.9751
    book1.fpaq0f2        424522    392716  0.9251
    book1.bpe            432845    409892  0.9470
    book1.fpaq0          435227    408745  0.9392
    book1.fpaq0p         442501    401828  0.9081
    book1.lzrw5          444560    409700  0.9216
    book1.fastlz         480393    415559  0.8650
    book1.lzrw2          490239    441498  0.9006
    book1.lzrw1-a        522189    457753  0.8766
    book1.ppp            552507    463669  0.8392
    book1.flzp           566253    447225  0.7898
    book1                768771    768771  1.0000
    book1.bwt            768775    768775  1.0000
    Last edited by Matt Mahoney; 20th June 2010 at 04:27. Reason: added second table

  5. #5
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    It occurs to me that if the 'high end' compressor understood the abbreviations used - the dictionary compression was integral to the compressor and not an opaque pre-processing step - you'd have all the advantages - speed, footprint - without any drop in compression?

    I should say I started such an approach at the beginning of the year, as I'm easily distracted by tangents - but Shelwien seemed unimpressed
    Last edited by willvarfar; 20th June 2010 at 12:15.

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I agree that preprocessing should not have any disadvantage if the compressor is designed to model it. The advantage is time and memory savings because the input is smaller. My plan is to write a zpaq model for enwik9.

  7. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    New version of preprocessor, bse v0.02, is attached. Dictionary codes are restricted to all upper case or all lower case. Faster parsing on pass 2. Fixed bugs in output display. Results of preprocessing on enwik8 are as follows. The first two columns are compressed sizes without and with preprocessing. The last column is the ratio. Less than 1 means compression was improved.

    Code:
                         enwik8    e8bse2
    enwik8.paq8px       18293940  18288674  0.9997
    enwik8.paq8l        18518485  18333599  0.9900
    enwik8.nz_cc        18826931  19070603  1.0129
    enwik8.lpaq9m       19072743  19052273  0.9989
    enwik8.zpaq_c3      19448650  19365333  0.9957
    enwik8.pmm          19701161  18978980  0.9633
    enwik8.lpaq1        19796957  19289133  0.9743
    enwik8.paq9a        20129573  19847576  0.9860
    enwik8.paq6         20303336  20229536  0.9964
    enwik8.cmm4         20548514  19560513  0.9519
    enwik8.zpaq_c2      20941558  19868971  0.9488
    enwik8.nz           20948832  20977258  1.0014
    enwik8.bwt.fpaq0f2  21798843  21901181  1.0047
    enwik8.paq1         22156982  22486123  1.0149
    enwik8.bwt.fpaq0p   23809591  23673942  0.9943
    enwik8.grzip        23846878  23280831  0.9763
    enwik8.bbb          24576921  24068284  0.9793
    enwik8.zpaq_c1      24837469  22204958  0.8940
    enwik8.tarsalzp     25134862  23379736  0.9302
    enwik8.lzpxj        25251404  22537698  0.8925
    enwik8.p6           25377998  24265351  0.9562
    enwik8.ctw          25453025  25437292  0.9994
    enwik8.7z           25895909  24555727  0.9482
    enwik8.szip         26120472  25369554  0.9713
    enwik8.pmd          26275353  24425284  0.9296
    enwik8.pms          26310248  25765583  0.9793
    enwik8.dmc          28402672  27598803  0.9717
    enwik8.cab          28465607  26876133  0.9442
    enwik8.bz2          29008758  27662479  0.9536
    enwik8.sr2          30432506  28119210  0.9240
    enwik8.RAR          35107917  32792401  0.9340
    enwik8.HA           36379137  33765799  0.9282
    enwik8.gz           36445248  34115146  0.9361
    enwik8.zip          36445470  34115373  0.9361
    enwik8.lzo          41217688  36850350  0.8940
    enwik8.sr           43091439  40008906  0.9285
    enwik8.fcm1         45402225  35999143  0.7929
    enwik8.Z            45763941  42692861  0.9329
    enwik8.lzrw3-a      48009194  42381523  0.8828
    enwik8.bpe          53906667  52276686  0.9698
    enwik8.fastlz       54658924  46271463  0.8465
    enwik8.lzrw2        55360907  46252666  0.8355
    enwik8.fpaq0f2      56916872  52746472  0.9267
    enwik8.flzp         57366279  47029921  0.8198
    enwik8.lzrw5        59375192  52884230  0.8907
    enwik8.lzrw1-a      59471657  48235845  0.8111
    enwik8.fpaq0p       61457810  54377872  0.8848
    enwik8.ppp          61657971  49266029  0.7990
    enwik8.fpaq0        63391013  56535499  0.8919
    enwik8             100000000  68200544  0.6820
    enwik8.bwt         100000004  68200548  0.6820
    
    total size of all files
    1,851,111,733 enwik8
    1,669,465,145 bse v0.01
    1,651,469,895 bse v0.02
    bse v0.02 creates the following dictionary. The first column is the 1 byte code. The second column is the number of times it appears in the output file e8bse2.
    Code:
      1   872349 ""  (escape codes)
      2  2407282 ""  (capitalization codes)
      3   529992 "^J"
      4   206250 "^J^J"
      5    32987 "^J  "
      6    52954 "^J    "
      7    77718 "^J      "
      8    22504 "^J        "
      9 12348864 " "
     10   123415 "  "
     11    24744 """
     12    46190 "#"
     13   479245 "&"
     14   191786 "'"
     15   377361 "''"
     16   221072 "("
     17   225748 ")"
     18   235063 "*"
     19   787826 ","
     20   326098 "-"
     21   794548 "."
     22   325049 "/"
     23    51130 "//"
     24   264686 "0"
     25   272564 "1"
     26    25732 "15"
     27   126841 "19"
     28   210974 "2"
     29    55723 "200"
     30   156663 "3"
     31   145926 "4"
     32   132176 "5"
     33   134227 "6"
     34   121640 "7"
     35   146017 "8"
     36   120162 "9"
     37   333366 ":"
     38   572842 ";"
     39   267136 "<"
     40   146194 "="
     41   127359 "=="
     42   267136 ">"
     43   158976 "A"
     44   179270 "B"
     45   134414 "C"
     46   112428 "D"
     47   113149 "E"
     48    31083 "H"
     49   139620 "I"
     50    76029 "J"
     51    27994 "L"
     52    55842 "M"
     53    63731 "P"
     54    48162 "R"
     55   134567 "S"
     56    78070 "T"
     57    67247 "W"
     58    65444 "["
     59   945578 "[["
     60    64457 "]"
     61   945639 "]]"
     62    49420 "_"
     63   795794 "a"
     64    77166 "ab"
     65   184894 "ac"
     66   435786 "al"
     67   124627 "am"
     68    80305 "amp"
     69   383520 "an"
     70   365315 "and"
     71   344828 "ar"
     72    87898 "are"
     73   264915 "as"
     74   163026 "at"
     75    61629 "ate"
     76    24841 "ati"
     77    60326 "ation"
     78    35023 "av"
     79    67015 "ay"
     80   625275 "b"
     81   268728 "be"
     82    37520 "ble"
     83   539501 "c"
     84   171616 "ca"
     85    53559 "cal"
     86    44001 "can"
     87   223048 "ce"
     88    46383 "ces"
     89   299653 "ch"
     90   123381 "ci"
     91   151144 "com"
     92   131420 "con"
     93    24892 "contributor"
     94    68248 "cu"
     95   897157 "d"
     96   298392 "de"
     97    72450 "der"
     98   247116 "di"
     99   972786 "e"
    100    97147 "ec"
    101   208873 "ed"
    102   114597 "el"
    103   204568 "en"
    104    81662 "ent"
    105   292875 "er"
    106   241183 "es"
    107    46067 "eve"
    108   649262 "f"
    109   175350 "for"
    110    73707 "fr"
    111    57623 "from"
    112   748574 "g"
    113   232812 "ge"
    114    56318 "ght"
    115   113753 "gt"
    116   247087 "h"
    117   202461 "ha"
    118   154768 "he"
    119    52090 "her"
    120   161047 "hi"
    121    69385 "his"
    122   150909 "ho"
    123    23108 "ht"
    124    50893 "http"
    125   714782 "i"
    126    98226 "ia"
    127   131675 "ic"
    128    70891 "ica"
    129   140407 "id"
    130   136833 "il"
    131   717862 "in"
    132   137690 "ing"
    133   108062 "ir"
    134    23202 "irst"
    135   298712 "is"
    136    20056 "ism"
    137    45566 "ist"
    138    35872 "ive"
    139    83366 "j"
    140   294335 "k"
    141    97227 "ke"
    142    37954 "ki"
    143    29476 "king"
    144   525368 "l"
    145   291588 "la"
    146    60052 "ld"
    147   351333 "le"
    148   298969 "li"
    149   164507 "lo"
    150   153500 "lt"
    151   184192 "ly"
    152   381465 "m"
    153   262662 "ma"
    154    90050 "man"
    155   226987 "me"
    156    97125 "ment"
    157    47600 "min"
    158   180223 "mo"
    159    66732 "mp"
    160    38858 "ms"
    161    62013 "mu"
    162  1147041 "n"
    163   118040 "na"
    164    39615 "name"
    165   225693 "ne"
    166   232914 "ng"
    167    56003 "not"
    168   616737 "o"
    169    80620 "od"
    170   526739 "of"
    171   150222 "ol"
    172    68842 "om"
    173   370094 "on"
    174    86087 "op"
    175   281609 "or"
    176    29080 "org"
    177    39182 "ory"
    178    67684 "ot"
    179    38405 "othe"
    180   177339 "ou"
    181    31613 "our"
    182    52735 "ow"
    183   423610 "p"
    184   143452 "pa"
    185    30773 "page"
    186    99444 "pe"
    187    59001 "per"
    188    96532 "pl"
    189   158512 "po"
    190   172239 "pr"
    191    46793 "pres"
    192    66977 "qu"
    193   193734 "quot"
    194   802236 "r"
    195   295225 "ra"
    196   464852 "re"
    197    27014 "ref"
    198    25226 "revision"
    199   272421 "ri"
    200   167641 "ro"
    201    41390 "rou"
    202  1525191 "s"
    203   304437 "se"
    204   279030 "si"
    205    28393 "sm"
    206   204471 "so"
    207   121494 "sp"
    208   359791 "st"
    209    29898 "state"
    210    49351 "sti"
    211   155052 "su"
    212   857894 "t"
    213   141217 "ta"
    214    33241 "tal"
    215   192381 "te"
    216    87147 "ted"
    217    39500 "ten"
    218   172734 "ter"
    219    33365 "text"
    220   275444 "th"
    221    85809 "that"
    222  1004336 "the"
    223   199124 "ti"
    224    70799 "tic"
    225    24719 "timestamp"
    226   204673 "tion"
    227    20493 "tis"
    228    32048 "title"
    229   424196 "to"
    230   115490 "ts"
    231    49596 "tt"
    232   104923 "ty"
    233   751131 "u"
    234    21579 "ual"
    235    58162 "uc"
    236   154576 "un"
    237   130518 "ur"
    238    50542 "ure"
    239    83858 "use"
    240    24256 "uti"
    241   154204 "v"
    242   202092 "ve"
    243   101256 "ver"
    244   139902 "vi"
    245   487765 "w"
    246    86849 "was"
    247    82033 "wi"
    248    85862 "with"
    249    52007 "wor"
    250    39476 "www"
    251   200850 "x"
    252   693669 "y"
    253    39763 "{{"
    254   453312 "|"
    255    40238 "}}"
    Attached Files Attached Files

  8. #8
    Member
    Join Date
    Aug 2009
    Location
    Bari
    Posts
    74
    Thanks
    1
    Thanked 1 Time in 1 Post

    Thumbs up GREAT!

    wow, fantastic, in my test with 7z:

    7z without preprocessor: 24.828.985 byte



    7z with bse001: 24.065.518 byte

    24.065.518/24.828.985 = 0,969

    7z with bse002: 23.895.701 byte

    23.895.701/24.828.985 = 0,962


    24.828.985 - 23.895.701 = 933.284 byte, about 1 mb!!!!
    Last edited by PiPPoNe92; 24th June 2010 at 01:36.

  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Here is a test using drt (dictionary encoding) from lpaq9m by Alexander Rasushnyak. I prepared the test file:

    lpaq9m d lpqdict0.enc lpqdict0.dic (84,603 -> 465,210)
    drt enwik8 enwik8.drt (100,000,000 -> 60,824,424)
    copy/b lpqdict0.dic+enwik8.drt en8drt (61,289,634 bytes)

    I did this so that the test file has sufficient information to decompress. This improves compression with all compressors tested.

    Code:
                         enwik8    en8drt
    enwik8.paq8px       18293940  17342041  0.9480
    enwik8.paq8l        18518485  17560378  0.9483
    enwik8.nz_cc        18826931  18633832  0.9897
    enwik8.lpaq9m       19072743  18077356  0.9478
    enwik8.zpaq_c3      19448650  18928856  0.9733
    enwik8.pmm          19701161  18650601  0.9467
    enwik8.lpaq1        19796957  18905483  0.9550
    enwik8.paq9a        20129573  19374291  0.9625
    enwik8.paq6         20303336  19439547  0.9575
    enwik8.cmm4         20548514  19133313  0.9311
    enwik8.zpaq_c2      20941558  19447733  0.9287
    enwik8.nz           20948832  20588807  0.9828
    enwik8.bwt.fpaq0f2  21798843  21406906  0.9820
    enwik8.paq1         22156982  21437426  0.9675
    enwik8.bwt.fpaq0p   23809591  22855730  0.9599
    enwik8.grzip        23846878  22379326  0.9385
    enwik8.bbb          24576921  22701384  0.9237
    enwik8.zpaq_c1      24837469  21559014  0.8680
    enwik8.tarsalzp     25134862  22773386  0.9060
    enwik8.lzpxj        25251404  21877402  0.8664
    enwik8.p6           25377998  23078246  0.9094
    enwik8.ctw          25453025  24454785  0.9608
    enwik8.7z           25895909  23487746  0.9070
    enwik8.szip         26120472  24045552  0.9206
    enwik8.pmd          26275353  23448205  0.8924
    enwik8.pms          26310248  23824677  0.9055
    enwik8.dmc          28402672  25532850  0.8990
    enwik8.cab          28465607  25963613  0.9121
    enwik8.bz2          29008758  25612712  0.8829
    enwik8.sr2          30432506  26328768  0.8652
    enwik8.RAR          35107917  30132497  0.8583
    enwik8.HA           36379137  30633820  0.8421
    enwik8.gz           36445248  30902821  0.8479
    enwik8.zip          36445470  30903043  0.8479
    enwik8.lzo          41217688  33358696  0.8093
    enwik8.sr           43091439  38492535  0.8933
    enwik8.fcm1         45402225  29581661  0.6515
    enwik8.Z            45763941  37478724  0.8190
    enwik8.lzrw3-a      48009194  38635335  0.8047
    enwik8.bpe          53906667  41403271  0.7681
    enwik8.fastlz       54658924  42337322  0.7746
    enwik8.lzrw2        55360907  41854974  0.7560
    enwik8.fpaq0f2      56916872  40415334  0.7101
    enwik8.flzp         57366279  43944882  0.7660
    enwik8.lzrw5        59375192  46019812  0.7751
    enwik8.lzrw1-a      59471657  43184084  0.7261
    enwik8.fpaq0p       61457810  44979267  0.7319
    enwik8.ppp          61657971  44103741  0.7153
    enwik8.fpaq0        63391013  47589951  0.7507
    enwik8             100000000  61289634  0.6129
    enwik8.bwt         100000004  61289638  0.6129
    Furthermore, the result is better than bse v0.02 in every case. So it looks like I will abandon this approach and use a larger dictionary. drt includes a dictionary tuned for enwik9. My plan is to generalize this technique to other text files.

    Code:
                         e8bse2    en8drt
    e8bse2.paq8px       18288674  17342041  0.9482
    e8bse2.paq8l        18333599  17560378  0.9578
    e8bse2.pmm          18978980  18650601  0.9827
    e8bse2.lpaq9m       19052273  18077356  0.9488
    e8bse2.nz_cc        19070603  18633832  0.9771
    e8bse2.lpaq1        19289133  18905483  0.9801
    e8bse2.zpaq_c3      19365333  18928856  0.9775
    e8bse2.cmm4         19560513  19133313  0.9782
    e8bse2.paq9a        19847576  19374291  0.9762
    e8bse2.zpaq_c2      19868971  19447733  0.9788
    e8bse2.paq6         20229536  19439547  0.9609
    e8bse2.nz           20977258  20588807  0.9815
    e8bse2.bwt.fpaq0f2  21901181  21406906  0.9774
    e8bse2.zpaq_c1      22204958  21559014  0.9709
    e8bse2.paq1         22486123  21437426  0.9534
    e8bse2.lzpxj        22537698  21877402  0.9707
    e8bse2.grzip        23280831  22379326  0.9613
    e8bse2.tarsalzp     23379736  22773386  0.9741
    e8bse2.bwt.fpaq0p   23673942  22855730  0.9654
    e8bse2.bbb          24068284  22701384  0.9432
    e8bse2.p6           24265351  23078246  0.9511
    e8bse2.pmd          24425284  23448205  0.9600
    e8bse2.7z           24555727  23487746  0.9565
    e8bse2.szip         25369554  24045552  0.9478
    e8bse2.ctw          25437292  24454785  0.9614
    e8bse2.pms          25765583  23824677  0.9247
    e8bse2.cab          26876133  25963613  0.9660
    e8bse2.dmc          27598803  25532850  0.9251
    e8bse2.bz2          27662479  25612712  0.9259
    e8bse2.sr2          28119210  26328768  0.9363
    e8bse2.RAR          32792401  30132497  0.9189
    e8bse2.HA           33765799  30633820  0.9072
    e8bse2.gz           34115146  30902821  0.9058
    e8bse2.zip          34115373  30903043  0.9058
    e8bse2.fcm1         35999143  29581661  0.8217
    e8bse2.lzo          36850350  33358696  0.9052
    e8bse2.sr           40008906  38492535  0.9621
    e8bse2.lzrw3-a      42381523  38635335  0.9116
    e8bse2.Z            42692861  37478724  0.8779
    e8bse2.lzrw2        46252666  41854974  0.9049
    e8bse2.fastlz       46271463  42337322  0.9150
    e8bse2.flzp         47029921  43944882  0.9344
    e8bse2.lzrw1-a      48235845  43184084  0.8953
    e8bse2.ppp          49266029  44103741  0.8952
    e8bse2.bpe          52276686  41403271  0.7920
    e8bse2.fpaq0f2      52746472  40415334  0.7662
    e8bse2.lzrw5        52884230  46019812  0.8702
    e8bse2.fpaq0p       54377872  44979267  0.8272
    e8bse2.fpaq0        56535499  47589951  0.8418
    e8bse2              68200544  61289634  0.8987
    e8bse2.bwt          68200548  61289638  0.8987

  10. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,502
    Thanks
    741
    Thanked 664 Times in 358 Posts

  11. #11
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 172 Times in 64 Posts
    look also at http://xwrt.sourceforge.net/

    XWRT works also with text files (not only XML) and creates a semi-dynamic dictionary (in 1st pass), which is stored at the beginning of an output file (in XWRT 3.1+ it's compressed with a prefix compression, but you can try XWRT 3.0).
    Last edited by inikep; 27th June 2010 at 11:16.

  12. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,502
    Thanks
    741
    Thanked 664 Times in 358 Posts
    btw, encoding is better in xwrt (at least when combined with ppmd), while building dictionary is probably better in dict (it's definitel faster, more flexible but i'm not sure that it's as useful for final size reduction as xwrt one)

  13. #13
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Quick test using dict with no options shows compression is not as good as drt. A few more programs are still running. I will also compare with -p. Also I need to try xwrt, which has lots of options. I believe that drt is derived from xwrt and tuned to LTCB.

    Code:
                         en8drt    e8dict
    en8drt.lpaq9m       18077356  19586005  1.0835
    en8drt.nz_cc        18633832  19653421  1.0547
    en8drt.pmm          18650601  19612228  1.0516
    en8drt.lpaq1        18905483  19868184  1.0509
    en8drt.zpaq_c3      18928856  20145926  1.0643
    en8drt.cmm4         19133313  19950206  1.0427
    en8drt.paq9a        19374291  20458736  1.0560
    en8drt.zpaq_c2      19447733  20575637  1.0580
    en8drt.nz           20588807  21476462  1.0431
    en8drt.bwt.fpaq0f2  21406906  22516831  1.0518
    en8drt.zpaq_c1      21559014  22656275  1.0509
    en8drt.lzpxj        21877402  22491983  1.0281
    en8drt.grzip        22379326  23692052  1.0587
    en8drt.tarsalzp     22773386  23858310  1.0476
    en8drt.bwt.fpaq0p   22855730  24445366  1.0696
    en8drt.p6           23078246  25557370  1.1074
    en8drt.pmd          23448205  25038178  1.0678
    en8drt.7z           23487746  24238277  1.0320
    en8drt.pms          23824677  26179265  1.0988
    en8drt.szip         24045552  25693053  1.0685
    en8drt.ctw          24454785  26326344  1.0765
    en8drt.ACB          24615204  25771201  1.0470
    en8drt.dmc          25532850  27788469  1.0883
    en8drt.bz2          25612712  27708681  1.0818
    en8drt.cab          25963613  26629029  1.0256
    en8drt.sr2          26328768  27975476  1.0625
    en8drt.fcm1         29581661  31153468  1.0531
    en8drt.RAR          30132497  30859169  1.0241
    en8drt.HA           30633820  31304270  1.0219
    en8drt.gz           30902821  31587299  1.0221
    en8drt.zip          30903043  31587521  1.0221
    en8drt.lzo          33358696  33773272  1.0124
    en8drt.Z            37478724  38985285  1.0402
    en8drt.sr           38492535  38067177  0.9889
    en8drt.lzrw3-a      38635335  38289232  0.9910
    en8drt.fpaq0f2      40415334  40374274  0.9990
    en8drt.bpe          41403271  42370376  1.0234
    en8drt.lzrw2        41854974  40623041  0.9706
    en8drt.fastlz       42337322  40764748  0.9629
    en8drt.lzrw1-a      43184084  41527158  0.9616
    en8drt.flzp         43944882  42991744  0.9783
    en8drt.ppp          44103741  42958346  0.9740
    en8drt.fpaq0p       44979267  41399294  0.9204
    en8drt.lzrw5        46019812  46170702  1.0033
    en8drt.fpaq0        47589951  42606093  0.8953
    en8drt              61289634  52812712  0.8617
    en8drt.bwt          61289638  52812716  0.8617

  14. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Following is a comparison of dict with dict -p. The -p option is supposed to improve compression with ppmd and ppmonstr. In fact it helps with most of the best compressors.

    Code:
                         e8dict    e8dicp
    e8dict.paq8px       18964164  18690653  0.9856
    e8dict.paq8l        19022328  18748162  0.9856
    e8dict.lpaq9m       19586005  19338820  0.9874
    e8dict.pmm          19612228  19305080  0.9843
    e8dict.nz_cc        19653421  19417192  0.9880
    e8dict.lpaq1        19868184  19637405  0.9884
    e8dict.cmm4         19950206  19728355  0.9889
    e8dict.zpaq_c3      20145926  19865477  0.9861
    e8dict.paq9a        20458736  20220415  0.9884
    e8dict.zpaq_c2      20575637  20286842  0.9860
    e8dict.paq6         21018203  20708010  0.9852
    e8dict.nz           21476462  23967683  1.1160
    e8dict.lzpxj        22491983  22262593  0.9898
    e8dict.bwt.fpaq0f2  22516831  22316399  0.9911
    e8dict.zpaq_c1      22656275  22387187  0.9881
    e8dict.paq1         23102888  22788030  0.9864
    e8dict.grzip        23692052  23518030  0.9927
    e8dict.tarsalzp     23858310  23624791  0.9902
    e8dict.7z           24238277  24046347  0.9921
    e8dict.bbb          24388541  24243951  0.9941
    e8dict.bwt.fpaq0p   24445366  24224987  0.9910
    e8dict.pmd          25038178  24802373  0.9906
    e8dict.p6           25557370  25325516  0.9909
    e8dict.szip         25693053  25475522  0.9915
    e8dict.ACB          25771201  25525001  0.9904
    e8dict.pms          26179265  26096269  0.9968
    e8dict.ctw          26326344  26049688  0.9895
    e8dict.cab          26629029  26390295  0.9910
    e8dict.bz2          27708681  27628045  0.9971
    e8dict.dmc          27788469  27779601  0.9997
    e8dict.sr2          27975476  27847652  0.9954
    e8dict.RAR          30859169  31034171  1.0057
    e8dict.fcm1         31153468  31536206  1.0123
    e8dict.HA           31304270  31532959  1.0073
    e8dict.gz           31587299  31802042  1.0068
    e8dict.zip          31587521  31802264  1.0068
    e8dict.lzo          33773272  34138178  1.0108
    e8dict.sr           38067177  38370286  1.0080
    e8dict.lzrw3-a      38289232  38725881  1.0114
    e8dict.Z            38985285  39307770  1.0083
    e8dict.fpaq0f2      40374274  41386227  1.0251
    e8dict.lzrw2        40623041  41186858  1.0139
    e8dict.fastlz       40764748  41215108  1.0110
    e8dict.fpaq0p       41399294  42379089  1.0237
    e8dict.lzrw1-a      41527158  42136200  1.0147
    e8dict.bpe          42370376  42757349  1.0091
    e8dict.fpaq0        42606093  43678804  1.0252
    e8dict.ppp          42958346  43691008  1.0171
    e8dict.flzp         42991744  43158193  1.0039
    e8dict.lzrw5        46170702  46736172  1.0122
    e8dict              52812712  53897083  1.0205
    e8dict.bwt          52812716  53897087  1.0205
    However, dict -p is still worse than drt for most compressors.

    Code:
                         en8drt    e8dicp
    en8drt.paq8px       17342041  18690653  1.0778
    en8drt.paq8l        17560378  18748162  1.0676
    en8drt.lpaq9m       18077356  19338820  1.0698
    en8drt.nz_cc        18633832  19417192  1.0420
    en8drt.pmm          18650601  19305080  1.0351
    en8drt.lpaq1        18905483  19637405  1.0387
    en8drt.zpaq_c3      18928856  19865477  1.0495
    en8drt.cmm4         19133313  19728355  1.0311
    en8drt.paq9a        19374291  20220415  1.0437
    en8drt.paq6         19439547  20708010  1.0653
    en8drt.zpaq_c2      19447733  20286842  1.0431
    en8drt.nz           20588807  23967683  1.1641
    en8drt.bwt.fpaq0f2  21406906  22316399  1.0425
    en8drt.paq1         21437426  22788030  1.0630
    en8drt.zpaq_c1      21559014  22387187  1.0384
    en8drt.lzpxj        21877402  22262593  1.0176
    en8drt.grzip        22379326  23518030  1.0509
    en8drt.bbb          22701384  24243951  1.0680
    en8drt.tarsalzp     22773386  23624791  1.0374
    en8drt.bwt.fpaq0p   22855730  24224987  1.0599
    en8drt.p6           23078246  25325516  1.0974
    en8drt.pmd          23448205  24802373  1.0578
    en8drt.7z           23487746  24046347  1.0238
    en8drt.pms          23824677  26096269  1.0953
    en8drt.szip         24045552  25475522  1.0595
    en8drt.ctw          24454785  26049688  1.0652
    en8drt.ACB          24615204  25525001  1.0370
    en8drt.dmc          25532850  27779601  1.0880
    en8drt.bz2          25612712  27628045  1.0787
    en8drt.cab          25963613  26390295  1.0164
    en8drt.sr2          26328768  27847652  1.0577
    en8drt.fcm1         29581661  31536206  1.0661
    en8drt.RAR          30132497  31034171  1.0299
    en8drt.HA           30633820  31532959  1.0294
    en8drt.gz           30902821  31802042  1.0291
    en8drt.zip          30903043  31802264  1.0291
    en8drt.lzo          33358696  34138178  1.0234
    en8drt.Z            37478724  39307770  1.0488
    en8drt.sr           38492535  38370286  0.9968
    en8drt.lzrw3-a      38635335  38725881  1.0023
    en8drt.fpaq0f2      40415334  41386227  1.0240
    en8drt.bpe          41403271  42757349  1.0327
    en8drt.lzrw2        41854974  41186858  0.9840
    en8drt.fastlz       42337322  41215108  0.9735
    en8drt.lzrw1-a      43184084  42136200  0.9757
    en8drt.flzp         43944882  43158193  0.9821
    en8drt.ppp          44103741  43691008  0.9906
    en8drt.fpaq0p       44979267  42379089  0.9422
    en8drt.lzrw5        46019812  46736172  1.0156
    en8drt.fpaq0        47589951  43678804  0.9178
    en8drt              61289634  53897083  0.8794
    en8drt.bwt          61289638  53897087  0.8794
    In many cases, dict -p is stil worse than no preprocessing.

    Code:
                         enwik8    e8dicp
    enwik8.paq8px       18293940  18690653  1.0217
    enwik8.paq8l        18518485  18748162  1.0124
    enwik8.nz_cc        18826931  19417192  1.0314
    enwik8.lpaq9m       19072743  19338820  1.0140
    enwik8.zpaq_c3      19448650  19865477  1.0214
    enwik8.pmm          19701161  19305080  0.9799
    enwik8.lpaq1        19796957  19637405  0.9919
    enwik8.paq9a        20129573  20220415  1.0045
    enwik8.paq6         20303336  20708010  1.0199
    enwik8.cmm4         20548514  19728355  0.9601
    enwik8.zpaq_c2      20941558  20286842  0.9687
    enwik8.nz           20948832  23967683  1.1441
    enwik8.bwt.fpaq0f2  21798843  22316399  1.0237
    enwik8.paq1         22156982  22788030  1.0285
    enwik8.bwt.fpaq0p   23809591  24224987  1.0174
    enwik8.grzip        23846878  23518030  0.9862
    enwik8.bbb          24576921  24243951  0.9865
    enwik8.zpaq_c1      24837469  22387187  0.9013
    enwik8.tarsalzp     25134862  23624791  0.9399
    enwik8.lzpxj        25251404  22262593  0.8816
    enwik8.p6           25377998  25325516  0.9979
    enwik8.ctw          25453025  26049688  1.0234
    enwik8.7z           25895909  24046347  0.9286
    enwik8.szip         26120472  25475522  0.9753
    enwik8.pmd          26275353  24802373  0.9439
    enwik8.pms          26310248  26096269  0.9919
    enwik8.dmc          28402672  27779601  0.9781
    enwik8.cab          28465607  26390295  0.9271
    enwik8.bz2          29008758  27628045  0.9524
    enwik8.sr2          30432506  27847652  0.9151
    enwik8.RAR          35107917  31034171  0.8840
    enwik8.HA           36379137  31532959  0.8668
    enwik8.gz           36445248  31802042  0.8726
    enwik8.zip          36445470  31802264  0.8726
    enwik8.lzo          41217688  34138178  0.8282
    enwik8.sr           43091439  38370286  0.8904
    enwik8.fcm1         45402225  31536206  0.6946
    enwik8.Z            45763941  39307770  0.8589
    enwik8.lzrw3-a      48009194  38725881  0.8066
    enwik8.bpe          53906667  42757349  0.7932
    enwik8.fastlz       54658924  41215108  0.7540
    enwik8.lzrw2        55360907  41186858  0.7440
    enwik8.fpaq0f2      56916872  41386227  0.7271
    enwik8.flzp         57366279  43158193  0.7523
    enwik8.lzrw5        59375192  46736172  0.7871
    enwik8.lzrw1-a      59471657  42136200  0.7085
    enwik8.fpaq0p       61457810  42379089  0.6896
    enwik8.ppp          61657971  43691008  0.7086
    enwik8.fpaq0        63391013  43678804  0.6890
    enwik8             100000000  53897083  0.5390
    enwik8.bwt         100000004  53897087  0.5390

Similar Threads

  1. compressing a really small 1k .COM file
    By Rugxulo in forum Data Compression
    Replies: 3
    Last Post: 28th November 2009, 01:32
  2. huffman's Coding
    By swapy in forum Data Compression
    Replies: 5
    Last Post: 12th August 2009, 23:51
  3. Ordered bitcodes experiments
    By Shelwien in forum Data Compression
    Replies: 19
    Last Post: 30th May 2009, 04:45
  4. RC Coding
    By rasputin in forum Data Compression
    Replies: 10
    Last Post: 6th November 2008, 19:54
  5. A Small Warning
    By encode in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 30th August 2008, 22:05

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •