Results 1 to 12 of 12

Thread: freqtable compression

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,207
    Thanks
    187
    Thanked 955 Times in 493 Posts

    freqtable compression

    http://nishi.dreamhosters.com/u/freqtable_v1.rar

    blkfrq book1 book1.frq creates a file with "uint16_t freq[256]" freqtable repeated for each 64k block of input file.

    frqcomp c book1.frq book1.ari compresses it (see model_freq.inc)

    Current model works like this:
          for( i=A,sum=0; i<B; i++ ) sum+=freq[i];
    encode( sum );
    uint h = (B-A)>>1;
    uint sum0 =
    freq_proc( freq,A+0,A+h, sum,0 );
    freq_proc( freq,A+h,B+0, 0, sum-sum0 );


    Even this is better than simple adaptive bytewise order0 coding:
    Code:
    63,086,136 // fp32_rc
    63,001,795 // fp32_rc with model_freq
    62,826,269 // sh_v1x with model_freq
    But there's some further potential:
    Code:
    475,648 1.frq        // 16-bit freqtables, 512 bytes per block
    227,489 2.ari        // frqcomp
    220,489 2.lzma       // lzma.exe e 1 2l -fb273 -mc999 -lc2 -lp1 -pb4 -mt1 
    219,133 2.pa         // 7z.exe a -m0=plzma:a1:fb273:lc2:lp1:pb4 2.pa 1
    191,220 1.paq8px181  // paq8px_v181.exe -7 1 
    191,102 1.paq8pxd67  // paq8pxd_v67_SSE42.exe -s7 1 
    202,600 1.paq8hp12   // paq8hp12.exe c 1 2p
    Any ideas on freqtable model improvement? (test file 1.frq can be found in archive above)

  2. #2
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    456
    Thanks
    142
    Thanked 156 Times in 104 Posts
    You could split into 512 byte chunks, delta each uint16 vs the previous uint16 (512 bytes ago), zig-zag(ish) and var-int encode, and then chuck through your favourite compression tool or entropy encoder. The logic being that each 64k block of input is likely to have some similarity to the last. Alternatively build up an average frequency and delta vs that, if we believe the frequencies are static.

    Doing the former and putting through xz gave me 211044, assuming my process was actually reversible and bug free. It harms more complex tools like paq though.

    I did a horrid perl 1 liner hack

    Code:
    perl -e '$/=undef;for($i=0;$i<256;$i++){$b[$i]=0}@a=unpack("S*",<>);$n=0;while($n < $#a){$i=0;foreach (@a[$n..$n+255]) {$x=$_;$_-=$b[$i];$_=(abs($_)*2)+($_<0);$b[$i++]=$x;do {print chr(($_ & 0x7f)+($_>=128?128:0));$_>>=7;}while($_)};$n+=256}'

  3. #3
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    456
    Thanks
    142
    Thanked 156 Times in 104 Posts
    Alternatively a data rotation by 512. So byte 0, 512, 1024, 1536, ..., followed by byte 1, 513, 1025, 1537, ... etc. Maybe in 16-bit quantities instead of 8-bit too if you wish to keep high and low bytes per value together.

    Rotate by 512 | paq8px_v101 yielded 183994 bytes (179070 on 16-bit quantities) and xz gets it to 192664. That's a trivially fast transform to use.

    Eg hacky perl 1-liner again for data rotation with 8 and 16-bit quantities.

    Code:
    @ seq3a[tmp/freqtabl...]; perl -e '$/=undef;@a=unpack("C*",<>);for ($i=0;$i<929;$i++) {for($j=0;$j<512;$j++) {$b[($j*929)+$i]=$a[($i*512)+$j]}};print pack("C*",@b)' 1.frq > 1.frq.rot1
    @ seq3a[tmp/freqtabl...]; xz < 1.frq.rot1|wc -c
    192664
    
    @ seq3a[tmp/freqtabl...]; perl -e '$/=undef;@a=unpack("S*",<>);for ($i=0;$i<929;$i++) {for($j=0;$j<256;$j++) {$b[($j*929)+$i]=$a[($i*256)+$j]}};print pack("S*",@b)' 1.frq > 1.frq.rot2
    @ seq3a[tmp/freqtabl...]; xz < 1.frq.rot2|wc -c
    199936
    # paq8 is 179070

  4. The Following User Says Thank You to JamesB For This Useful Post:

    Shelwien (14th August 2019)

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,207
    Thanks
    187
    Thanked 955 Times in 493 Posts
    Can't really rotate anything - I'm trying to make a blockwise adaptive rc to compete with rANS.
    So I'd like each table to be stored in the header of each block.
    But yeah, I guess there's more similarity in freqs of individual symbols than I expected.

    Code:
    47,270,885 0.250s 0.125s // rans_static -o0
    46,715,033 1.203s 2.031s // gcc82/k8
    46,715,033 1.047s 2.016s // gcc82/native
    46,715,033 1.094s 2.234s // clang900x/k8 
    46,715,033 0.750s 1.938s // IC19 AVX2
    46,502,448 0.781s 2.000s // BLKSIZE=1<<14 (was 15)

  6. #5
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    456
    Thanks
    142
    Thanked 156 Times in 104 Posts
    Blockwise you need to evaluate those compression tools by chopping up into blocks and testing each block individually then, but knowing how much you can get when not chopped at least hints at a baseline to aim for.

    You may be able to predict the upcoming frequencies based on what's gone before in this block of frequencies. E.g. if we're encoding freq for symbol 200 and the sum of all freqs for symbols 0 to 199 add up to 65535 then we know there's only 1 symbol left out there. Depending on the cardinality of the symbols, it may also be better to sort by frequency and write the symbol letters out in addition to the delta frequencies, but generally I found that didn't work too well. Some data sets it may do though.

    I'm not sure how FSE works, but it's likely Yann will have put some effort into finding a good solution that doesn't cost too much time.

    Edit: or perhaps there is a sweet range, expressible as 2 bytes (eg 32 to 122) whose symbols we encode first and then we encode the remainder (0-31 and 123 to 255) armed with the knowledge the total remaining frequency is very low. If you can work out which range leaves no unstored frequency of >= 256 then you know the others all fit in 1 byte, meaning no var-int encoding is necessary and freqs between 128 to 255 don't take an additional byte up (assuming a byte-based varint method instead of bit-bases).

  7. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,207
    Thanks
    187
    Thanked 955 Times in 493 Posts
    > knowing how much you can get when not chopped at least hints at a baseline to aim for.

    It already works better than a similar coder with incremented freqs.
    And overall compression improves when I use smaller blocks.

    > You may be able to predict the upcoming frequencies based on
    > what's gone before in this block of frequencies.

    Well, I actually have a proper bitwise adaptive model for freqs there,
    it already produces results similar to default lzma.
    Just that it doesn't remember individual freqs from the previous block atm.

    > E.g. if we're encoding freq for symbol 200 and the sum of all freqs for symbols 0 to 199
    > add up to 65535 then we know there's only 1 symbol left out there.

    That's probably a rare case.

    > Depending on the cardinality of the symbols, it may also be better to sort by frequency
    > and write the symbol letters out in addition to the delta frequencies,

    That's the idea I had actually - I already sort symbols anyway.
    But it also doesn't take into account similarity of symbol freqs in adjacent blocks.

    > I'm not sure how FSE works, but it's likely Yann will have put some effort
    > into finding a good solution that doesn't cost too much time.

    I think zstd just compresses the stats with default literal coder using precomputed state tables.
    Jyrki Alakuijala said that its not worth extra complexity when headers take like 0.2% of compressed data (0.5% here).
    But I think that its better to compress the headers as much as possible (since it doesn't change the processing speed anyway).

  8. #7
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    679
    Thanks
    214
    Thanked 209 Times in 129 Posts
    While Huffman has canonical coding, optimal encoding of frequencies for accurate entropy coder is a difficult problem.
    I had a thread about it: https://encode.su/threads/1883-Reduc...y-distribution
    Ideally we would like to optimize minimal description length:

    header cost + redundancy from inaccuracies ~ cost of storing probabilities + Kullback-Leibler * length
    Kullback-Leibler(p,q) ~ sum_s (p_s - q_s)^2 / p_s

    So e.g. we should use larger accuracy for longer blocks and lower probability symbols.
    Here is some my old construction (but I haven't worked further on that):
    - encode CDF as points in (0,1) range,
    - start with splitting the range into #symbols equal subranges, encode number of points in each with unary coding - trick from James Dow Allen which turns out nearly optimal (slide 17, 18 here),
    - then scan symbols adding single bits of information until reaching chosen precision:


  9. The Following User Says Thank You to Jarek For This Useful Post:

    Shelwien (14th August 2019)

  10. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,207
    Thanks
    187
    Thanked 955 Times in 493 Posts
    > Ideally we would like to optimize minimal description length

    With AC/rANS we can use all information from precise symbol counts, so precision trade-offs don't make any sense.

    > start with splitting the range into #symbols equal subranges

    My implementation recursively encodes range sums until it reaches individual symbols,
    but based on paq results (and James' preprocessing) there's a potential for improvement.

  11. #9
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    679
    Thanks
    214
    Thanked 209 Times in 129 Posts
    Storing entire counts is a simple option, but not necessarily optimal - you might need a smaller total number of bits if storing approximated frequencies instead.

    The splitting into subranges is just the first step - top of the diagram, then below is adding successive bits until reaching chosen accuracy: higher for low probabilities.

  12. #10
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    456
    Thanks
    142
    Thanked 156 Times in 104 Posts
    Quote Originally Posted by Shelwien View Post
    > Depending on the cardinality of the symbols, it may also be better to sort by frequency
    > and write the symbol letters out in addition to the delta frequencies,

    That's the idea I had actually - I already sort symbols anyway.
    But it also doesn't take into account similarity of symbol freqs in adjacent blocks.
    If you're encoding block by block but not having synchronisation points for random access, then yes using the previous freqs helps.
    It's possible you may want to just exploit the previous symbol order though. One idea: for text we store ETAONRISHD... for the first block of letters sorted by frequency, along with their frequencies as deltas to the previous. The next block we don't store the symbol order, but do store the frequencies in that order, with delta (which may now be negative) and then sort to compute the actual frequency order - eg maybe I is now more common and has moved up a couple: ETAOINRSHD. Our delta in the second block would mean I frequency is negative when delta vs R, but zig-zag encoding can cope with that. I don't know if this helps much, but I'd guess the order of symbol frequency to vary less than the actual frequencies themselves. If so then perhaps we can use this to reduce complexity in the frequency values.

  13. #11
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    456
    Thanks
    142
    Thanked 156 Times in 104 Posts
    Quote Originally Posted by Jarek View Post
    Storing entire counts is a simple option, but not necessarily optimal - you might need a smaller total number of bits if storing approximated frequencies instead.

    The splitting into subranges is just the first step - top of the diagram, then below is adding successive bits until reaching chosen accuracy: higher for low probabilities.
    One thing I wrestled with was how to store frequencies for both large and small blocks. Large blocks we'll be normalising the frequencies to, say, 14 bit anyway. Small blocks we don't want to normalise them upwards as that's inventing extra precision that we have to store. We could store the actual or normalised (whatever is smaller) and always renormalise on decode, but you have to be very careful when renormalising in the decoder to ensure the maths is exact with no potential rounding differences. Instead I normalise to the maximum bit count (say 14) or the next highest bit count if the block is small. Then the decode process can either use a smaller lookup table or, for my use in Order-1 tables where everything must be the same normalised number, scale up using a simple bit shift.

    Storing order-1 frequencies, ie a 2D array, poses its own problems though. I wanted completely random access so couldn't use previous blocks, but I can use order-0 frequencies to aid the order-1 storing. I store a list of all the order-0 symbols followed by all the order-1 tables for those symbols only, including all the zero frequency entries. For those, I also have a run length because many order-1 frequencies have large runs of 0-frequency values (for symbols that may occur in another context). Example code here: https://github.com/jkbonfield/htscod...ns4x16.js#L919

  14. #12
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    679
    Thanks
    214
    Thanked 209 Times in 129 Posts
    Especially for small blocks, it is worth to consider very approximated frequencies, like in representation from diagram above: with accuracies optimizing header cost + length * Kullback-Leibler.
    Canonical Huffman can be seen as example of such approximation: representing only 1/2^k probabilities.

    It might be also worth to consider parametric distributions, like with Golomb codes for Laplace distribution and storing/predicting only their parameters ( https://encode.su/threads/3124-LOCO-...ny-alternatves ).

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •