Page 1 of 2 12 LastLast
Results 1 to 30 of 34

Thread: Large files, known format, static dictionary, asymmetric compression [Seeking Advice]

  1. #1
    Member
    Join Date
    Aug 2011
    Location
    California, USA
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Smile Large files, known format, static dictionary, asymmetric compression [Seeking Advice]

    Greetings all, this is my first post here.

    I have large-ish files (1GB-20GB) containing data from scientific equipment. The data is "confidence scores" from the equipment, 1 byte per score, and tend to cluster around a few values (see below).

    Here are my constraints for the project:

    • Compression can be resource-heavy and multi-threaded, but decompression must be lightweight and fast (ie, need asymmetric alg)
    • I can carry a "reasonably" sized static dictionary alongside the compressor/decompressor (say 1GB)
    • I can include libraries that are under any license except the GPLs

    In case you are more interested in the particulars of the data format, I'm basing experiments on an 8.6GB file; here is a histogram of the values from my sample file:
    Code:
    00:      1670541      01:      3747154      02:     36071659      03:     92542478
    04:     81758287      05:     28633135      06:     28057031      07:     26724964
    08:     22940418      09:     27044628      0a:     30942826      0b:     31110416
    0c:     34169154      0d:     31430615      0e:     31219345      0f:     32424171
    10:     37500184      11:     39805532      12:     38590627      13:     42697505
    14:     44072307      15:     48421933      16:     59579526      17:     86290873
    18:    108580923      19:    127009325      1a:    121578878      1b:    165814677
    1c:    228058142      1d:    340066910      1e:    670717321      1f:   1164031879
    20:   2182807077      21:   1753181866      22:    730086113      23:    105239027
    24:      5665306      25:      1158446      26:       484824      27:       162003
    28:        99947
    All values not listed above do not occur in the sample.

    Plain Huffman coding compresses to 0.44, but I'm trying to do much better. I've tried SO many things, without anything satisfying. gzip gets 0.46, bz2 0.42 and lzma -7 0.40. So nothing dramatically beats Huffman codes. (The symmetric zpaq -m1 gets 0.39.) None of these leverages my ability to use a static dictionary.

    Questions:

    1. Does the poor result on LZ, LZMA, BW mean there just isn't enough "pattern-based" redundancy to extract and I'm doomed to live with the Shannon bound?
    2. Is there a compression framework out there that lets me train a model with a few hundred GB of data and then produce a static dictionary to help improve compression beyond what Huffman gives?

    Any advice is greatly appreciated!
    Last edited by Fixee; 26th August 2011 at 03:01.

  2. #2
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    That stats above mean nothing to any dictionary based
    compression, but of course to entropy coders.

    Look at your data for any other patterns...

    If each byte depends on the previous byte, that is, their
    difference is small, then try Bulat's filters and see if it
    improves.

    EDIT... also you might want to try a simple transform, where

    Each byte of the output is the combination of the two bytes
    from the input, where the outputs' nibbles are from the
    low bits of the input (since all data are less than 2

    And apply LZMA.

    if LZMA iss at 0.40 now, I believe it will improve.
    Last edited by incblitz; 25th August 2011 at 09:03.

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    huffman/arithmetic coder fully utilizes the stats you provided. higher-level methods usually find correlations between various parts of data and then encode these correlations with the same huf/ari

    all packers you have tried prvoded results not better than simple huf - this means that their higher-level models was inapproproate for your data. you may try other packers. even if some better packer is too slow in decoding, it will prove that better compression is possible and show character of data correlation

    in order to find better packer for your data, one need data sample

  4. #4
    Member
    Join Date
    Aug 2011
    Location
    California, USA
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks for the reply incblitz!

    Quote Originally Posted by incblitz View Post
    Look at your data for any other patterns...
    Yeah, I've been searching for patterns by hand, and nothing's really appearing. I am hoping to use some kind of automated search to assist me here.

    EDIT... also you might want to try a simple transform, where

    Each byte of the output is the combination of the two bytes
    from the input, where the outputs' nibbles are from the
    low bits of the input (since all data are less than 2

    And apply LZMA.

    if LZMA is at 0.40 now, I believe it will improve.
    I'm confused by this suggestion. It sounds like you are suggesting I pack two bytes into one, but since each input byte ranges from 0x0-0x28, you cannot do this and have it be invertible. In order to pack two bytes into one, we need the range to be restricted to 0x0-0x0f, right?

    If you are suggesting a different transform, please correct me. I can certainly transform each pair of bytes a, b into a*0x29 + b, but this will be 12 bits.

    I can pack 4 input values into 3 bytes, if you think that would help LZMA?!

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    I can pack 4 input values into 3 bytes, if you think that would help LZMA?!
    it shouldn't help. huf is enough to squeeze out these 2 unused bits

  6. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    usually the best way to find a compresor is to learn data source. for example, text packers aren't developed by trying every method possible but by learning properties of natural language

    so you need to search sources of redundancy in the way data are produced

    in particular, as incblitz suggested, try to use subtraction: b[i]=a[i]-a[i-1], then compress b[] instead of a[]

  7. #7
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    @Fixee: have you tested the new compressor bsc too ?

    bsc 2.8.0: http://libbsc.com/downloads.aspx

    especially test with switches -m4 or -m5 or -m6

    commandlines for example: (inpfile is the name of your datafile)

    bsc e inpfile out-st4-b256-cpM128 -m4 -cp -b256 -M128
    bsc e inpfile out-st4-b328-cpM128-t -m4 -cp -b328 -t -M128

    bsc e inpfile out-st5-b256-cpM128 -m5 -cp -b256 -M128
    bsc e inpfile out-st5-b328-cpM128-t -m5 -cp -b328 -t -M128

    bsc e inpfile out-st6-b256-cpM128 -m6 -cp -b256 -M128
    bsc e inpfile out-st6-b328-cpM128-t -m6 -cp -b328 -t -M128


    best regards

    ---
    if you have a nvidia grafics card - bsc 3.0.0 supports GPU via cuda

    bsc 3.0.0: http://encode.su/attachment.php?atta...0&d=1314085966

  8. #8
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts
    If you are just working on symbol statistics, you can't beat Entropy coding (ie. Huffman or Range Coder).

    If you want more compression, you need to find other correlations, such as :
    1) Is the value [n+1] somewhat correlated to value [n] (as already suggested)
    2) Could the trend (derivative) provide usefull hint on next value (such as [n] ~= [n-1] + ([n-1] - [n-2]))
    3) Are some ranges of values specifically cluttered within some parts of the file (adaptive entropy statistics would help in this case, try huff0 as an example)
    4) Could the correlation follow a predicted pattern, such as value [n] roughly equivalent to value [n-12] for example (you can guess that if you know how your samples are being collected)

    and so on.

    With just global symbol statistics, there is simply not enough information to beat single pass entropy.

  9. #9
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Sorry I didn't notice it was in hex.

    And forgot that nibbles have 0x0f as upper bound.

    Bulat's right.

    Anyways, what are these "confidence scores"?
    Last edited by incblitz; 25th August 2011 at 16:52.

  10. #10
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    237
    Thanks
    39
    Thanked 92 Times in 48 Posts
    Quote Originally Posted by Fixee View Post
    Plain Huffman coding compresses to 0.40, but I'm trying to do better. I've tried SO many things, without anything satisfying. gzip gets 0.46, bz2 0.42 and lzma -7 0.40. So nothing soundly beats Huffman codes. (The symmetric zpaq -m1 gets 0.39.) None of these leverages my ability to use a static dictionary.
    ...
    Any advice is greatly appreciated!
    If paq8px can't get better than 0.39, then probability that it's possible is very low... but not zero, see http://encode.su/threads/1100?p=2199...ll=1#post21992

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  11. #11
    Member
    Join Date
    Aug 2011
    Location
    California, USA
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    usually the best way to find a compresor is to learn data source. for example, text packers aren't developed by trying every method possible but by learning properties of natural language

    so you need to search sources of redundancy in the way data are produced

    in particular, as incblitz suggested, try to use subtraction: b[i]=a[i]-a[i-1], then compress b[] instead of a[]
    Greetings Bulat!

    I tried the using the first-order differential that incblitz suggested. Both b[i] = a[i] - a[i-1] and b[i] = a[i] ^ a[i-1], but the statistics worsened. (The histogram became more spread out.) I also tried a 2nd-order differential (the differences of the differences) and things degenerated further, so I don't think this is going to work, unfortunately.

    The data I'm trying to compress comes in "runs" of 108 bytes, with millions of runs per file. I thought maybe each run looked like the others, so I tried a strategy that compressed each column independently as well. In other words, assume the file looks like this:

    Code:
    b11 b12 b13 b14 ... b1108    // 108 bytes per row
    b21 b22 b23 b24 ... b2108
    .
    .                            // 100 million rows
    I grouped all the bytes of the first column together, then the 2nd column, ... then the 108th column.

    This failed miserably (LZ was much worse, huffman was the same).

    Since I am unable to find patterns by hand, I'm hoping to find or write a tool to do it.

  12. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    LEARN YOUR DATA SOURCE. ask yourself - if i know results of previous thousands of experiments and know results of first 60 points of experiment, can this help me to predict next value better than pure huffman stats?

  13. #13
    Member
    Join Date
    Aug 2011
    Location
    California, USA
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Alexander Rhatushnyak View Post
    If paq8px can't get better than 0.39, then probability that it's possible is very low... but not zero, see http://encode.su/threads/1100?p=2199...ll=1#post21992
    Thanks for the pointer, Alexander.

    I think there is some kind of pattern to my data: zpaq -m4 compressed my 8GB file to 0.36 (in 18 hours). However, I'm not sure how to figure out which models (of the 22 used by -m4) are doing the best job. That would be helpful if I knew.

  14. #14
    Member
    Join Date
    Aug 2011
    Location
    California, USA
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Cyan View Post
    If you are just working on symbol statistics, you can't beat Entropy coding (ie. Huffman or Range Coder).

    If you want more compression, you need to find other correlations, such as :

    2) Could the trend (derivative) provide usefull hint on next value (such as [n] ~= [n-1] + ([n-1] - [n-2]))
    This is an interesting suggestion, and it's possible that there are these kinds of functions embedded in my data. The problem is, how do I find them? I think if I just try arbitrary functions, I will spend a lot of time and probably not succeed.

    What I need is an automated way to look for patterns like this by feeding sample data to a program that tries to find such correlations.

  15. #15
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    higlight this again: Bulat: LEARN YOUR DATA SOURCE
    Me: Anyways, what are these "confidence scores"?

    If it is known how these data are produced, i.e. how
    the source behaves, it is possible to create a special
    model for this.

    And it will most probably compress better than any
    general purpose algorithm, or combinations of filters etc.

  16. #16
    Member
    Join Date
    Aug 2011
    Location
    California, USA
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    If I cannot make progress with context modeling and other ways of leveraging AI to help me compress these files, here is what I plan to do:

    Let's say I have 500 files, each is about 10GB in size, and all contain this experimental data. I am willing to keep a 1GB static dictionary. Then the smartest thing to do is find some "common subset" of byte sequences over the whole set of 500 files.

    Of course we have the familiar trade-off: if I keep short sequences then they occur more often (good!), but pointers to them are longer (bad), and each pointer represents fewer bytes (bad).

    But my biggest question is this: how do I even build this dictionary? I have 5TB of data with no internal boundaries. How do I find the most common byte sequences of the "most favorable" lengths?

    Is this problem well-studied?

  17. #17
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    How are these data produced? This is the best lead, as opposed to training data.

  18. #18
    Member
    Join Date
    Aug 2011
    Location
    California, USA
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by incblitz View Post
    higlight this again: Bulat: LEARN YOUR DATA SOURCE
    Me: Anyways, what are these "confidence scores"?

    If it is known how these data are produced, i.e. how
    the source behaves, it is possible to create a special
    model for this.

    And it will most probably compress better than any
    general purpose algorithm, or combinations of filters etc.
    I understand what you're saying here.

    The problem is, I've already tried to understand the data as much as I can, but it's beyond my abilities to do it by hand. That's why I want a machine to help me.

    The data are coming from a real-world measurement device, operated by a human technician. There is error introduced by both the device and the human. The bytes are equal to -10log(p) where p is the error_probability. So p=0.1 is 10. p=0.01 is 20.

    As you can see above, the most common value in my file is 0x20 = 32, meaning the probability of error was about 0.00063 for those measurements.

  19. #19
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    If bytes are independent, then an order 0 compressor like fpaq0, fpaq0p, fpaq0f2, or zpaq -mo0 (see http://mattmahoney.net/dc/text.html#1493 for o0.cfg) would be the best you could do.

  20. #20
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Seems that way.

    But how is the error introduced by the machine and the human?

  21. #21
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Matt Mahoney View Post
    If bytes are independent, then an order 0 compressor like fpaq0, fpaq0p, fpaq0f2, or zpaq -mo0 (see http://mattmahoney.net/dc/text.html#1493 for o0.cfg) would be the best you could do.
    With human generated data there might be not obvious correlations based on time of measurement, who performed it etc.

  22. #22
    Member
    Join Date
    Aug 2011
    Location
    California, USA
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Matt Mahoney View Post
    If bytes are independent, then an order 0 compressor like fpaq0, fpaq0p, fpaq0f2, or zpaq -mo0 (see http://mattmahoney.net/dc/text.html#1493 for o0.cfg) would be the best you could do.
    Thanks Matt.

    I don't think the bytes are independent. For example, if I (arbitrarily) slice my sample file into 16 byte runs, and then sort these and count duplicates, I get:

    Code:
      57227 ff010000f9000100e1e2e1e0d9e0dfe2
      44268 00000000000000000000000000000000
      44123 ff010000f90003e2e1e2e1e0d9e0e0e2
      11876 ff010000f9000100e1e2e1e0d9e0dfc2
      10976 ff010000f900e3e2e1e2e1e0d9e0dfe2
       8571 ff010000f900e3e2e1e2e1e0d9e0e0e2
       7844 fef6ff010100020000d7e0e0e1dfe1e0
       4385 ff010000f9000100e1e2e1e0d9e0dee2
       3601 fef6ff0101000200e0d7e0e0e1dfe1e0
       3138 ff010000f9000100ff00e0dfd8dfdee1
    (Note that these runs are "normalized" by XORing through by the leftmost most-common value in the run.) I'm pretty sure this is far from what we'd expect to see with independent bytes generated to match the file distribution.

    I ran zpaq -mo0 as you suggested:

    No code has to be inserted here.
    Last edited by Fixee; 26th August 2011 at 03:05.

  23. #23
    Member
    Join Date
    Aug 2011
    Location
    California, USA
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by m^2 View Post
    With human generated data there might be not obvious correlations based on time of measurement, who performed it etc.
    ... how many cups of coffee the technician had, how cute the technician across the table is, etc...

    Yup, these are things that aren't available in the file.

    But still there are patterns somehow...

  24. #24
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    You could try fv and see if there are any patterns in the data. You can get it at http://mattmahoney.net/dc/dce.html#Section_2
    Try it on a chunk no bigger than 2 GB.

  25. #25
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,377
    Thanks
    215
    Thanked 1,023 Times in 544 Posts
    Can't you just post a sample of your data? Like a megabyte or something?
    Its clear now that there're some kind of structures inside, but most of compressors
    except for some slow CMs don't even have any special handling for fixed-size records.

    Anyway, depending on whether there're long string matches, the solution would be likely
    either preprocessor+lzma, or a structure-aware CM.
    But its hard to just make a list of ideas to try for you without any context.


    > Does the poor result on LZ, LZMA, BW mean there just isn't enough
    > "pattern-based" redundancy to extract and I'm doomed to live with
    > the 0.40 Shannon bound?

    Yes. though its still possible that some simple preprocessing might help,
    like reordering the bytes by their offsets in records or splitting
    the random low bits into a separate stream, while capturing patterns in
    high bits.

    Also you can try playing with lzma options (lc/lp/pb).
    For example, with 16-byte records lp4 pb4 would be a good thing to try.

    > Is there a compression framework out there that lets me train a
    > model with a few hundred GB of data and then produce a static
    > dictionary to help improve compression beyond what Huffman gives?

    Not really... we'd be out of jobs if something like that existed

    But its usually a good idea to try compressing stuff with paq8px69
    to estimate the redundancy (usually the same or better result can
    be reached with a much faster model too).

    And then you can try running the optimizer in https://sites.google.com/site/toffer..._0.6_100206.7z

  26. #26
    Member
    Join Date
    Aug 2011
    Location
    California, USA
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    Can't you just post a sample of your data? Like a megabyte or something?
    Sure thing. Below is a 1MB sample. The record lengths are 108 bytes (though I cannot see any periodic data that would help you see this!) and this is only one field from a heterogeneous data file (the other fields are easy to compress).
    Attached Files Attached Files

  27. #27
    Member
    Join Date
    Aug 2011
    Location
    California, USA
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Well, after claiming in the previous post that there is no indication that my record length is 108, you can see I was wrong: it's clear in the fv output as a black line:

    Click image for larger version. 

Name:	fv.gif 
Views:	335 
Size:	73.2 KB 
ID:	1652

    Also, I have been saying huffman gets 0.40, but I just re-ran it and I was wrong: it's 0.44. So already several compressors are beating it.
    Last edited by Fixee; 26th August 2011 at 03:06.

  28. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,377
    Thanks
    215
    Thanked 1,023 Times in 544 Posts
    Well, for now it still seems that straight low-order CM is the best here:
    Code:
    1048592 sample.bin
     510099 huffman // via http://encode.su/threads/1183-Huffman-code-generator
     454042 sample.7z
     442539 sample.lzma // plzma.exe c0 sample.bin sample.lzma 25 9999999 273 6 0 0
     434407 sample.plzma // plzma.exe c1 sample.bin sample.plzma 25 9999999 273 8 0 0
    1049758 1.bmp // bin2bmp 108 sample.bin 1.bmp; http://nishi.dreamhosters.com/u/miniutils_v1.rar
    1049974 1r.bmp // + rotate
     437868 1.bmf // BMF 2.01 -S; http://compression.ru/ds/bmf_2_01.rar
     439100 1r.bmf
     439169 sample.ofr // ofr.exe --encode --raw --channelconfig MONO --sampletype UINT8 sample.bin
     402325 1.bmp.paq8px // http://paq8.hys.cz/paq8px_v69.zip
     413073 1r.bmp.paq8px
     395416 sample.bin.paq8px
     483302 1.png // pngout
     434067 qlfc c sample.bin 7; http://nishi.dreamhosters.com/u/bsc240_qlfc_v2.rar
     408663 09g2a\o1rc.exe c sample.bin 5; http://www.ctxmodel.net/files/mix_test/mix_test_v9.rar
     407530 m1.exe c best.txt sample.bin 8
    1. paq8 gain over bwt postcoders is insignificant
    2. 2D modelling isn't of much help
    3. audio modelling isn't helpful either

    So I'd say it should be possible to compress this file to about 400000 with CM at about 10MB/s (symmetric).
    Also need to test lzma with a really large dictionary (like 1G) on the whole set.
    There's clearly some weird structure (eg. see http://nishi.dreamhosters.com/u/1a.png) so
    it should be still possible to build a better model... but it would be likely slow enough, and not much better (maybe 10-20% further gain)

  29. #29
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Here is a ZPAQ model that beats paq8px_v69. I modified mid.cfg by adding model orders 0..3 that include the column position (offset mod 10. These are organized into order 0 CM (direct context model) and order 1..3 indirect context models, with orders 2 and 3 chained from the CM (this after various experiments). I also added a SSE after the mixer using order 1 and column as the context. It takes about 7 seconds on my laptop (2 GHz T3200) with JIT enabled (g++ 4.5.0).

    Code:
    (x.cfg)
    
    comp 4 7 0 0 13 (hh hm ph pm n)
      0 icm 5        (order 0...5 chain)
      1 isse 13 0
      2 isse $1+17 1
      3 isse $1+18 2
      4 isse $1+18 3
      5 isse $1+19 4
      6 match $1+22 $1+24  (order 7)
      7 cm 16 24 (order 0...3 mod 108)
      8 icm $1+17
      9 isse $1+18 7
      10 isse $1+19 7
      11 mix 16 0 11 24 255  (order 1)
      12 sse 20 11 48 255 (order 1 mod 108)
    hcomp
      c++ *c=a b=c a=0 (save in rotating buffer M)
      d= 1 hash *d=a   (orders 1...5 for isse)
      b-- d++ hash *d=a
      b-- d++ hash *d=a
      b-- d++ hash *d=a
      b-- d++ hash *d=a
      b-- d++ hash b-- hash *d=a (order 7 for match)
      d++ a=c a%= 108 a<<= 9 *d=a (order 0...3 mod 108)
      d++ *d=a b=c b-- hash *d=a
      b-- d++ hash *d=a
      b-- d++ hash *d=a
      d++ a=*c a<<= 8 *d=a (order 1 for mix)
      d++ a=c a%= 108 a<<= 8 a+=*c a<<= 5 *d=a (order 1 mod 108 for sse)
      halt
    post
      0
    end
    To compress, save as x.cfg, then: zpaq -mx c sample-x sample.bin

    Code:
      395,056 sample-x.zpaq
      395,381 sample.bin.paq8px
      401,465 sample-m4.zpaq
      406,322 sample-m3.zpaq
      418,100 sample.bsc
      418,745 sample-m2.zpaq
      430,159 sample-o2.zpaq
      430,615 sample-o1.zpaq
      437,323 sample-m1.zpaq
      454,042 sample.7z
      455,370 sample-o0.zpaq
      503,014 sample.bin.gz
    1,048,592 sample.bin
    You can also pass an argument to control memory usage as -mx,1 to double, -mx,-1 to halve, etc. By default it uses 305 MB.

  30. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,377
    Thanks
    215
    Thanked 1,023 Times in 544 Posts
    I made an ascii converter for the sample.
    The structure seems pretty complicated...

    Code:
    NC57OWRTTUUVSUUVVWSUUVVWSUUVVWSUUVVVSUUWVWSUUWVWSUUWVWRTRWUWFVRTUVLPNWVVPKNWUNPIMTPLAJGMKLFDKUSQMOHMKUIKFFKA
    6HB5TVSUUWVXTVVWVXTVVWWXTVVWWXTVVWWVSVVWWXTUVWWXSRVWWVTUVWVVQSSWWWQPMWVWPRSWVWTQSWVVOQQXVTNPPXSVNPHWSOMJJTRO
    GMI3OQRNRTUUVTUUUUVRUUUWVQTUTSTTGTURSORDTMPLGKCRJNLDD6RQJHLIUTQSDCIFLAJ9QKKGL9MJEAD3JMJ3G4QJEANAMJJAE6PJABFJ
    MO32OWRTTUUVSUUVTWSUUVVWSUUVUVSUUVVVQUUWVTOOUWVWIPPWVVQCNQPOEKMTRNFDHNIIHGFUSQKADBH3GJHQILF4F8LJC888C9B3393B
    839BNWRTTUUVSUUVVWSUUVVWSUUVVVSUUWVVSUUWVWSUUWVVQUVWVWPFVVUTMRPVUSLHQOVREMHWUSN3ITNPAEGLHLB2FIG89968KAGDAB66
    C4GBOWRTTUUVSUUVVWSUUVVWSUUVV5RRRWVVSUUWVLSRUWVWQUVWSWSTUWVWNSUWUXPHRRSRHOKVUUQAGTPULNPVTUNIOQLGEFGXTVOHNVOT
    JHJKOWRTTUUVSUUVTWVSUVVWSUUVVWSUUVVVSUUWVWMUUWVWKUVWUVQTVUUWDQQWVWL9QNMOHEFUSUK7ITNSKONUTTN6MWHGF6HWMUI5LQON
    GEL3OWRTTUUVSUUVUWSUUVTWSUUVVVSUUVVVPUUWVWSUUWVUILVWVTOCNQSOFIFSSPGDMQSOKDHWPVH8ITQ84IDMKAB6F84D388QKNIF86IG
    C5E8OWRTTUUVSUUVVWSUUVUVSUUVVWSUUVUR5RRUTWQRUWTWNRRWUWPCRQURJAOLSSJ9MNNOKK9LSRKDFSNRLINNQLBDDHKKF773762332B4
    4BHAOWRTTUUVSUUVVWSUUVVWSUUVVVSUUVVVSUUVVUSUUWVWIUVWVWPTUUSTJRMVUVPNQQVTMKKNSVNEMTPUMOPQTUNDHWPKHKHWTUM5JMON
    DEI3OWRTTUUVSUUVVWSUUVUWSUUVVWSUUVVVSUUWVWSUUWVWQVRWVWSIPUVTSRQVVWPKSPSOMKFWUUNIMTQUMNPUQHF2CCLCE8EOMNDHDMKA
    C7J5OWRTTUUVSUUVVWSUUVVWSUUVVWSUUVVVSUUWVWQQUVVWKURWSWPISWFRJNMTTPMEJKIMAGHWTKHJMUPTFMMS7OBDDTLMJKLR9RIK8QLC
    H7I3TVSIUVVXTUVWUXTVVWWXTVVWWXSVVWWVSVVWWXTVUWSTQRUWWXPQKWUVOKQWVVIIHXTTKOLTRTJOQUUTJLOWSRL9FKLNIJ9PSOMMLVPJ
    GCLKOWRTTUUVSUUVVWSUUVVWSUUWVWSJUVVVPUUWVWSUUWVWQUVWVHJCSUSRJQPVVWLNQQSGMKFISSHJLUQUMNONTH5CFMNIC8FFAF34HM7J
    C7E4NUOQPUUVSUUVUWSUUVVWQRRTUWSUUVVVSQUWVWSOUWVWKNRWSTOHROPODNFSTPH9KCFH2629ID326EHLA349AL2263GD8333KB3393HH
    8G95TVSUUWVXTVVWVXTVVWWXTVVWWXTVVWWWTVVWWXTVVWWXSVVWWWTUVWVWQUSWWXOORVTWMDOWRUOQSXVVOQQXVTINRWTVMNPXQQNJNQQS
    MMBBOWRTTUUVSUUVVWSUUVVWSUUVVWSUUVVVSUUWVWSUUWVWQRPWSWPHPUUTNRPVVWLKNOPPMKHVUQNDCSNRA8GQQLGAKU6EA93PKFB7LSKH
    H8E3OWRTTUUVSUUVVWSUUVVWSUUVVVSUUVVVSUOWVTSQRWVWQVRWVWPCNURRJMOTUVL6NNSOMEFRTRLBITPOGCJ8N4368EG32338C3433BF9
    BIJ5TVSUUWVXTVVWVXTVVWWXTVVWWXTVVWWWTVVWWWTRVWWWQRUWWWQMPVVTOOIUVVOIRVTVMQOXSTLMNOTPGNNURRMKLKMBIE6GL5FD9KK9
    GHC3OWRTTUUVSUUVVWSUUVVWSUUVVWSUUVVVSUUWVWSUUWVWQVVWVWSIUWVWNSRWVWPNRQVVODFUSRHJMSPUEMKSTPFDGMGICA5OKJ2B3KEA
    7CH3NWRTTUUVSUUVVVSUUVUWSUUVVVSUUVUVSUQWVWQUUVVWNRUWVWPCUQUTERMVVVLHNNNOKIHUSUNELTOTAKOQRQGBKWLMH2CSKAF77IKH
    3CFATVSUUWVXTVVWVXTVVWWXTVVWWXTVVWWWTRVWWXTVVWWXSVVWWWTUVWWWTSSWVWQPRWVWPSRWVVPQSXVTNRQWUVNPPWUVNPPJOGHIGJNK
    GLL3OWRTTUUVSUUVVWSUUVVWSUUVVWSUUVVTSUUWVWQUUVVWSVVWVWRIUWPWMRUVVWJNMRPRMIHWRU8H8TNRLQGUKOHF9U8GA93N8GA8IVKN
    HGH7TVSUUWVXTVVWVXTVVWWXTVVWWXTVVWWWSVVWWWTRVWWXSRVWWWTRSWVVQSQWVWLORVVVPORVTVLQSXUSIOOWKVMNOUTUNMPXTTNNNTRR
    JML3OWRTTUUVSUUVVWSUUVVWSUUVVWSUUVVVSUUWVWSUUWVWQPRWLSPCSVSMA7JROKF5HKIMGDFUSSN2LNMTG9KSKG5CFEPGH5G8KNI3A3KG
    CJJ5TVSUUWVXTVVWVXTVVWWXTVVWWXTVVWWWTVVWWXTVVWWXOVVWUXTUSWWWPSSWUWILMVVUJLPVTVLSSWSVIROUUQNNPWTQMNPXTHEL9QG9
    DFG3QSPSSVUWSUUWTVSUUWSWSRQVVR5QQUTQLQMMMRJLOVRS752HH6CD28IK352DIG283PJHC3JVIPGJIBNM3A78J5B3CFJ6AELA3H847N6C

Page 1 of 2 12 LastLast

Similar Threads

  1. Small dictionary prepreprocessing for text files
    By Matt Mahoney in forum Data Compression
    Replies: 40
    Last Post: 23rd June 2011, 07:04
  2. Dummy Static [Windowless] Dictionary Text Decompressor
    By Sanmayce in forum Data Compression
    Replies: 4
    Last Post: 12th October 2010, 19:55
  3. Asymmetric PAQ
    By kampaster in forum Data Compression
    Replies: 11
    Last Post: 27th August 2010, 05:16
  4. Advice in data compression
    By Chuckie in forum Data Compression
    Replies: 29
    Last Post: 26th March 2010, 16:09
  5. LZSS with a large dictionary
    By encode in forum Data Compression
    Replies: 31
    Last Post: 31st July 2008, 22:15

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •