Page 6 of 7 FirstFirst ... 4567 LastLast
Results 151 to 180 of 204

Thread: Finite State Entropy

  1. #151
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    892
    Thanks
    492
    Thanked 280 Times in 120 Posts
    There is a new feature available on the beta branch of FSE : streaming API.
    https://github.com/Cyan4973/FiniteSt...ropy/tree/beta

    This API goes at bitStream level, and allows the mixing of multiple bitStream sources,
    including, but not limited to, multiple FSE coder/decoders.

    The API is in fact already used within Zhuff.
    But it's the first time I take the time to properly expose and document it.

    As an example, the API is used within FSEDist, a simple length descriptor, mixing FSE-compressed prefixes with raw bitFields.

    As usual, this is still considered a raw first release, so it still requires some testings.
    Comments on the API itself (its usability or functionality for example) are also welcomed.


    Regards

  2. Thanks (2):

    Bulat Ziganshin (6th February 2014),samsat1024 (6th February 2014)

  3. #152
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    892
    Thanks
    492
    Thanked 280 Times in 120 Posts
    For information

    Latest version of FSE at https://github.com/Cyan4973/FiniteStateEntropy
    now includes error detection at each phase,
    and introduces a safer variant of the decoder, supposed to resist to invalid input data.

    Among the test programs, there is also a new fuzzer tool, which tries to force errors onto compression & decompression functions.
    Since latest update, all sorts of weird corner cases have been detected and corrected, so it starts to become less obvious to make it fail.
    Nonetheless, it still early days, so please handle this release with attention.
    It's not production validated yet, but that's the general direction...

  4. Thanks:

    Bulat Ziganshin (9th February 2014)

  5. #153
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Cyan View Post
    ...so it starts to become less obvious to make it fail.

  6. #154
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I made a fork of FSE that supports playing with custom symbol arrangement algorithms: https://github.com/neal-burns/FiniteStateEntropy

    There's a trivial example in test/custom_spread.c that can be used as a starting point. I haven't tested it as thoroughly as I'd like to, but it seems to work for me.

    The input to the function is the symbols in sorted order; it should copy them into output in permuted order.

  7. #155
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    nburns, if you want to do it right:
    - allow to choose among a few common distributions: binomial, Gaussian, (two-directional) exponential ...
    - take quantizator to get p[s]~f[s]/L, like in very good Charles article: http://cbloomrants.blogspot.com/2014...ng-ans-10.html
    - allow the spread function to use both f[s], but also the original p[s],
    - finally to find deltaH through simulations (not eigenvectors), generate long sequence from this probability distribution and subtract optimal entropy for this sequence: sum_s count[s]*lg(1/p[s]) with count[s] counts appearances in this sequence.

  8. #156
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    With respect to distributions, are you talking about creating synthetic input data?

    The quantizer is Yann's. I decided to leave that alone, because symbol arrangement seems like a more interesting and current topic, and the quantizer looks sophisticated already.

    The spread function uses only the information stored in the block header so that it's decodable.

    I'd like to look at generating interesting statistics at some point.

    The goal of this is partly to help answer the general question of ideal tANS performance, and partly to judge how much FSE could improve by altering the spread function. I think both are interesting and relevant.
    Last edited by nburns; 20th February 2014 at 11:47.

  9. #157
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    As the efficiency of spread function depends on probability distribution, as a theoretician, I would first use different fixed distributions to understand strengths and weaknesses of given method.

    If the decoder knows only quantized values (f[s]), you will probably not get much better than precise initialization - the question here is only how to make it faster.
    Much more interesting question is about improving the use of real probabilities (p[s]) - it is also the question of what information to store in the header - how to quantize probabilities.
    The inaccuracy of the lowest probable symbols have the largest impact on deltaH ~ sum_s (p_s-q_s)^2 / p_s.
    So while there is huge difference between f[s]=1 and f[s]=2, quantization to f[s]=1024 could cover a large interval of probabilities with nearly no penalty - we can use sparser quantization for more probable symbols.

    And so we should improve quantization accuracy for low probable symbols, for example beside f[s]=1 symbol appearance, I advise to use additional: f[s]=1/2 count value for extremely low probable symbols (corresponding to the lowest: p[s]~1/L/sqrt(2)), what means that there is a single appearance, and it should be placed at the end of the table (before starting the proper spread function).

    ps. I see the recent Yann's implementation has also got this nearly 2x speedup: from 280MS/s decoding to 520MS/s (zhuff is probably next).
    Last edited by Jarek; 20th February 2014 at 16:17.

  10. #158
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    892
    Thanks
    492
    Thanked 280 Times in 120 Posts
    FSE has been updated, by merging all improvements stored up to now into the "beta" branch.
    I was expecting to take more time to polish it, but since equivalent ideas are bursting around on the net, and with limited time available, let's push this version as is, and continue polishing it in the next few weeks.

    3 major points :

    1) There is an experimental "low mem" mode.
    It's not yet fully completed, and still deserves quite some polishing.
    But at least it works correctly.

    The intention is to reduce the size of tables required to decode a stream of literals, to around 3 or 4 KB.
    Other symbol types, using less <256 values, should be able to work with even less memory.
    There is however some work to be completed, especially on the header size optimization, and maybe on value distribution.
    So that's why this is still considered an "experimental" mode, with quite some improvements to come.

    It's possible that nburns' custom symbol arrangement fork could be relevant for this mode.
    I had no time to look at it yet, but expect to have an interesting look into it within the next few days.


    2) Speed is improved, noticeably

    Code changes are surprisingly limited though, most noticeably loops are mixing multiple state changes.
    It was made to answer the critical request to merge several streams using different tables into a single bitStream.
    Incidentally, it also resulted into this speed boost.
    It's by the way similar with the speed trick proposed and described by Charles' Bloom on his blog, which was published at about the same time,
    and one can also find similarities with xxHash's strategy of using multiple parallel accumulators.


    3) Number Presentation

    I'm aware that some performance figures have spread around,
    a less known fact is that some of these figures are limited to the "hot loop" timings,
    disregarding efforts required to collect stats, build tables, manage read/write headers, and so on.
    (incidentally not implementing or delivering limited/non-optimized functions devoted to this scope).

    It logically results in much higher numbers,
    and since few people are actually looking into the code to notice such difference in scope,
    it also results in many people believing and spreading on the net that "this particular version is better/faster".

    Unfortunately, I can't expect everybody to pay that much attention to details. Basically "bigger number sells".

    So the only logical answer I could find is to align communication on this new, more favorable, metric.
    I feel it's less representative of "real world" performance, but well, at least, it allows equal-term comparison.

    Starting with this version, the "default mode" benchmarks hot-loop timing only.
    The "full mode" is still available, using documented '-b' command,
    and detailed benchmark figures using the more complete '-b' setup are published into data directory,
    comparing FSE with zlib's Huffman entropy coder (thanks to caveman's work).

    Note also that "comparing with zlib's huffman" is not the same as "comparing with huffman",
    faster huffman version may exist, it's just we don't know any (yet).

  11. Thanks (3):

    Bulat Ziganshin (20th February 2014),JamesB (20th February 2014),Jarek (20th February 2014)

  12. #159
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    You need to be careful to use the correct zlib mode as otherwise it adds gzip headers and CRC32 all of the data.

    I have a tiny little zlib_test.c program to control zlib strategies on the command line, which I find useful. (It's suprising how often Z_FILTERED beats Z_DEFAULT_STRATEGY, and sometimes even Z_RLE is the winner on size too.

    I do have a zlib compatible huffman encoder which is considerably faster at encoding, but slower at decoding. (So a marriage of the two would make sense). Some benchmarks, with 2 repeated runs to show reproducability. There's some variance, but not huge.


    Code:
    jkb@gen1a[work/compress...] time taskset 0x400 ./zlib_test Z_HUFFMAN_ONLY < ~/scratch/data/enwik9 > /tmp/a
    dcurr=1000000000
    Strat=2 level=-1
    
    real    0m20.247s
    user    0m19.030s
    sys     0m1.180s
    
    jkb@gen1a[work/compress...] time taskset 0x400 ./zlib_test Z_HUFFMAN_ONLY < ~/scratch/data/enwik9 > /tmp/a
    dcurr=1000000000
    Strat=2 level=-1
    
    real    0m20.452s
    user    0m19.150s
    sys     0m1.130s
    
    jkb@gen1a[work/compress...] time taskset 0x400 ./deflate_interlaced <  ~/scratch/data/enwik9 > /tmp/A
    Input 1000000000 bytes
    Output 639672981 bytes
    
    real    0m10.374s
    user    0m8.530s
    sys     0m1.820s
    
    jkb@gen1a[work/compress...] time taskset 0x400 ./deflate_interlaced <  ~/scratch/data/enwik9 > /tmp/A
    Input 1000000000 bytes
    Input 1000000000 bytes
    Output 639672981 bytes
    
    real    0m10.203s
    user    0m8.590s
    sys     0m1.450s
    
    jkb@gen1a[work/compress...] ls -l /tmp/[aA]
    -rw-r--r-- 1 jkb team117 639672981 Feb 20 13:56 /tmp/A
    -rw-r--r-- 1 jkb team117 638332797 Feb 20 13:55 /tmp/a
    So my encoder is approximately double the speed of the zlib one on this data set; rather suprising given it's not excessively optimised. Decoding is the other way around. Note I did two trials for each with the output from each other to demonstrate they are interchangable.

    Code:
    jkb@gen1a[work/compress...] time taskset 0x400 ./zlib_test -d < /tmp/a > /tmp/b
    dcurr=638332797
    real    0m7.151s
    user    0m5.140s
    sys     0m1.380s
    
    jkb@gen1a[work/compress...] time taskset 0x400 ./zlib_test -d < /tmp/A > /tmp/B
    dcurr=639672981
    real    0m7.240s
    user    0m5.100s
    sys     0m1.540s
    
    jkb@gen1a[work/compress...] md5sum /tmp/[bB]
    e206c3450ac99950df65bf70ef61a12d  /tmp/B
    e206c3450ac99950df65bf70ef61a12d  /tmp/b
    
    
    jkb@gen1a[work/compress...] time taskset 0x400 ./deflate_interlaced -d <  /tmp/a > /tmp/b
    real    0m13.297s
    user    0m10.880s
    sys     0m2.400s
    
    jkb@gen1a[work/compress...] time taskset 0x400 ./deflate_interlaced -d <  /tmp/A > /tmp/B
    real    0m12.370s
    user    0m10.600s
    sys     0m1.750s
    
    jkb@gen1a[work/compress...] md5sum /tmp/[bB]
    e206c3450ac99950df65bf70ef61a12d  /tmp/B
    e206c3450ac99950df65bf70ef61a12d  /tmp/b
    So I'm around 70% slower at decoding it seems.

    Note on this particular machine FSE is still faster though:

    Code:
    jkb@gen1a[work/compress...] /bin/rm /tmp/a; time taskset 0x400 ./FiniteStateEntropy/test/fse ~/scratch/data/enwik9 -o /tmp/a; ls -l /tmp/a
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Feb 20 2014)
    Compressed 1000000000 bytes into 639409674 bytes ==> 63.94%
    real    0m5.751s
    user    0m5.200s
    sys     0m0.540s
    -rw-r--r-- 1 jkb team117 639409674 Feb 20 14:03 /tmp/a
    
    jkb@gen1a[work/compress...] /bin/rm /tmp/b; time taskset 0x400 ./FiniteStateEntropy/test/fse -d /tmp/a -o /tmp/b
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Feb 20 2014)
    Decoded 1000000000 bytes
    
    real    0m4.641s
    user    0m3.960s
    sys     0m0.680s

  13. #160
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    And for completeness, my "full coder" benchmark of the ANS vs arithmetic coders:

    Code:
    jkb@gen1a[work/compress...] time taskset 0x400 ./rANS_static -o0 < ~/scratch/data/enwik9 > /tmp/a
    Took 7269308 microseconds, 137.6 MB/s
    
    real    0m7.559s
    user    0m6.760s
    sys     0m0.780s
    
    jkb@gen1a[work/compress...] time taskset 0x400 ./rANS_static -o0 < ~/scratch/data/enwik9 > /tmp/a
    Took 7080561 microseconds, 141.2 MB/s
    
    real    0m7.499s
    user    0m6.550s
    sys     0m0.800s
    
    jkb@gen1a[work/compress...] time taskset 0x400 ./rANS_static -d < /tmp/a > /tmp/b
    Took 5164506 microseconds, 193.6 MB/s
    
    real    0m5.171s
    user    0m4.440s
    sys     0m0.720s
    jkb@gen1a[work/compress...] time taskset 0x400 ./rANS_static -d < /tmp/a > /tmp/b
    Took 4945168 microseconds, 202.2 MB/s
    
    real    0m5.972s
    user    0m4.460s
    sys     0m0.980s
    Code:
    jkb@gen1a[work/compress...] time taskset 0x400 ./arith_static -o0 < ~/scratch/data/enwik9 > /tmp/a
    Took 6772369 microseconds, 147.7 MB/s
    
    real    0m7.142s
    user    0m6.300s
    sys     0m0.790s
    
    jkb@gen1a[work/compress...] time taskset 0x400 ./arith_static -o0 < ~/scratch/data/enwik9 > /tmp/a
    Took 6771719 microseconds, 147.7 MB/s
    
    real    0m7.240s
    user    0m6.200s
    sys     0m0.870s
    
    jkb@gen1a[work/compress...] time taskset 0x400 ./arith_static -d < /tmp/a > /tmp/b
    Took 9643226 microseconds, 103.7 MB/s
    
    real    0m10.745s
    user    0m9.030s
    sys     0m1.130s
    
    jkb@gen1a[work/compress...] time taskset 0x400 ./arith_static -d < /tmp/a > /tmp/b
    Took 9759582 microseconds, 102.5 MB/s
    
    real    0m11.197s
    user    0m9.110s
    sys     0m1.200s
    These are whole file conversions and not just memory-to-memory benchmarks excluding frequency table generation and encoding/decoding, so they are more directly comparable to the FSE and huffman tests above. Fabian has sped things up more since I did my earlier wrapping up of his code, so things will change a bit (eg the 64-bit version).

  14. #161
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    892
    Thanks
    492
    Thanked 280 Times in 120 Posts
    I do have a zlib compatible huffman encoder which is considerably faster at encoding,
    You're welcomed to propose a faster huffman codec alternative.
    The intention of the benchmark program is to compare all these alternatives.

    Note that the version proposed by caveman is quite fast already, at around 250 MB/s (buffer to buffer, no disk/io involved).
    It should complete enwik9 in 5 seconds, not in 20 seconds.
    Maybe the I/O part plays a significant impact on the metric.


    Also, as an unrelated note :
    I'm curious to know what could be the result of an o1 coder using FSE bitStream.
    It seems to me that it's now possible, using bitStream methods :

    Code:
    void* FSE_initCompressionStream(void** op, ptrdiff_t* state, const void** symbolTT, const void** stateTable, const void* CTable);
    void FSE_encodeByte(ptrdiff_t* state, bitContainer_forward_t* bitC, unsigned char symbol, const void* CTable1, const void* CTable2);
    static void FSE_addBits(bitContainer_forward_t* bitC, size_t value, int nbBits);
    static void FSE_flushBits(void** outPtr, bitContainer_forward_t* bitC);
    int FSE_closeCompressionStream(void* outPtr, bitContainer_forward_t* bitC, int nbStates, ptrdiff_t state1, ptrdiff_t state2, ptrdiff_t state3, ptrdiff_t state4, void* compressionStreamDescriptor, const void* CTable);

  15. #162
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 64 Times in 33 Posts
    Quote Originally Posted by JamesB View Post
    You need to be careful to use the correct zlib mode as otherwise it adds gzip headers and CRC32 all of the data.
    The code I handed over to Yann is based on Zlib but it doesn't get through the Zlib API, it's pur Huffman à la Zlib (I've removed nearly everything related to the LZ part even in the header since it saves 11 or 12 bits) with no extra gzip header or checksum, there's probably some room left for a slight speed increase in the decoding part. Unfortunately I don't have much time to play with it these days.

  16. #163
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    Quote Originally Posted by caveman View Post
    The code I handed over to Yann is based on Zlib but it doesn't get through the Zlib API, it's pur Huffman à la Zlib (I've removed nearly everything related to the LZ part even in the header since it saves 11 or 12 bits) with no extra gzip header or checksum, there's probably some room left for a slight speed increase in the decoding part. Unfortunately I don't have much time to play with it these days.
    That explains some of the differences I see then. No doubt my machine isn't as fast as the one you tested on (as FSE is also different in speed), but I was wondering why your zlib seemed close to FSE for encoding performance. No doubt there are faster non-zlib compatible huffman coders out there too. (I'm curious why it's not benchmarked against Huff0.)

    My own zlib rewrite is noddy. It also has no LZ part. I wrote it because I needed a way to break into the stream between the symbol code-length table and the bit stream as I was using one set of symbol frequencies coupled to many separate "bodies" of compressed bits. Zlib doesn't offer such flexibility. I also extended it with an interleaved coder so that 16-bit numerical data (for example) could be cheatily encoded with high bytes encoded with one code table and low bytes with another, but streamed together[1]. This isn't as a good as full 16-bit huffman, but it's almost as good for the specific task I had and was still very quick. The code is at https://sourceforge.net/p/staden/cod...e_interlaced.c

    It's pretty simplistic to be honest, but it got the job done. I only posted it here as despite being noddy it is still faster at encoding than zlib so it offers a different view. No doubt there are many further improvements could be made.

    [1] perl -e 'for ($i=0;$i<1000000;$i++) {$a=int(rand()*1000+10000);print chr($a/256),chr($a%256)}' > a
    jkb@gen1a[work/compress...] ./deflate_interlaced -r2 < a > /dev/null
    Input 2000000 bytes
    Output 1283010 bytes
    jkb@gen1a[work/compress...] ./deflate_interlaced -r1 < a > /dev/null
    Input 2000000 bytes
    Output 1487759 bytes

    (Theoretical is 1000000*l(1000)/l(2)/8 == 1245723 bytes)

  17. Thanks:

    caveman (20th February 2014)

  18. #164
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    892
    Thanks
    492
    Thanked 280 Times in 120 Posts
    I'm curious why it's not benchmarked against Huff0.
    - Huff0 is not open source
    - Caveman's version of zlib's huffman is faster than Huff0

  19. #165
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 64 Times in 33 Posts
    Thanks for sharing the code, I took a peek a it, obviously it has been written to be a Huffman only encoder right from the start whereas Zlib expects Huffman codes and raw bits to be written in the stream. I think I could somewhat speed up the way variable length codes are written during the encoding phase based on your work but that would move the code base a way from Zlib in the sense that this improvement could hardly be backported to the original Zlib code... Will try it (no promise).

  20. #166
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Jarek View Post
    As the efficiency of spread function depends on probability distribution, as a theoretician, I would first use different fixed distributions to understand strengths and weaknesses of given method.
    I'm not sure how to generate permutations based on those distributions. Could you explain, please?

    Even before doing those kinds of tests, I'd like to get an idea of the magnitude of the difference between worst case and best case. So I'd like to look for the worst case.

    If the decoder knows only quantized values (f[s]), you will probably not get much better than precise initialization - the question here is only how to make it faster.
    My change will help to judge performance vs. compression efficiency of different algorithms, also.

    Much more interesting question is about improving the use of real probabilities (p[s]) - it is also the question of what information to store in the header - how to quantize probabilities.
    The inaccuracy of the lowest probable symbols have the largest impact on deltaH ~ sum_s (p_s-q_s)^2 / p_s.
    So while there is huge difference between f[s]=1 and f[s]=2, quantization to f[s]=1024 could cover a large interval of probabilities with nearly no penalty - we can use sparser quantization for more probable symbols.
    I'm not quite sure how that might work. Does your proposal only affect the header, or does it also directly change the table?

    An alternative is substituting escapes for the rare symbols, as Charles Bloom recommended.

    Edit: I think I see what you're saying about quantization. E.g., you might sacrifice the low bits of the most common symbols to maximize space usage in the header, and possibly give extra precision to the rare symbols to get better compression. Floating point?

    And so we should improve quantization accuracy for low probable symbols, for example beside f[s]=1 symbol appearance, I advise to use additional: f[s]=1/2 count value for extremely low probable symbols (corresponding to the lowest: p[s]~1/L/sqrt(2)), what means that there is a single appearance, and it should be placed at the end of the table (before starting the proper spread function).
    Spreading and quantizing are probably so intertwined that they could merge. Except that giving the spreader access to the full probabilities means more data in the header. The testing I've been doing is showing that the symbol stream error is already pretty small, and the header size is non-negligible.
    Last edited by nburns; 21st February 2014 at 06:14.

  21. #167
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Yann --
    Something that would improve the code is if you would insert comments that explain what algorithm you're using, particularly when it's not obvious and/or a choice had to be made, and provide references where appropriate.

    For instance, if the symbol spreading function is based on one of Jarek's papers, mention the paper and which algorithm you're using from that paper. You could even reference a discussion on this site.

    It's probably normally not worth the effort, but this is cutting edge and it would really be useful.

  22. Thanks:

    Bulat Ziganshin (21st February 2014)

  23. #168
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    892
    Thanks
    492
    Thanked 280 Times in 120 Posts
    OK, I can try to improve the amount of comments in the code.

    As regarding the spreading function, it's not based on Jarek's paper.
    Jarek is more interested in accurate distribution strategies.
    The version currently delivered within FSE has been designed primarily for speed and simplicity.
    There's a blog entry on that : http://fastcompression.blogspot.fr/2...ol-values.html

    Let's say I found its spreading capability "good enough", certainly not "very accurate".
    The objective is to be fast and simple, with an eye on simple CPUs with limited resources and capabilities.

  24. #169
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Cyan View Post
    OK, I can try to improve the amount of comments in the code.

    As regarding the spreading function, it's not based on Jarek's paper.
    Jarek is more interested in accurate distribution strategies.
    The version currently delivered within FSE has been designed primarily for speed and simplicity.
    There's a blog entry on that : http://fastcompression.blogspot.fr/2...ol-values.html

    Let's say I found its spreading capability "good enough", certainly not "very accurate".
    The objective is to be fast and simple, with an eye on simple CPUs with limited resources and capabilities.
    Before I saw what FSE does, I was envisioning trying the inverse of that spread function: visit table cells sequentially and take symbols at fixed intervals from a sorted array (modulo table length, relatively prime interval).

    For code comments, just give some sort of reference so somebody could look it up. Nothing lengthy. It's like a bibliography.

  25. #170
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    892
    Thanks
    492
    Thanked 280 Times in 120 Posts
    Before I saw what FSE does, I was envisioning trying the inverse of that spread function: visit table cells sequentially and take symbols at fixed intervals from a sorted array (modulo table length, relatively prime interval).
    Yep, I was hoping to do that too.
    Unfortunately, I haven't found a way to do it.
    The inverse of modulo prime is another modulo prime, and it doesn't preserve the sorted order.
    If there is a way to do it, I'm sure it's going to be very interesting

  26. #171
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Cyan View Post
    Yep, I was hoping to do that too.
    Unfortunately, I haven't found a way to do it.
    The inverse of modulo prime is another modulo prime, and it doesn't preserve the sorted order.
    If there is a way to do it, I'm sure it's going to be very interesting
    Maybe your idea is more sophisticated than what I was thinking, because I'm pretty sure my idea would permute the symbols just fine, but I don't have much of a case for why it would be particularly good for compression:


    Code:
    j=0;
    for(i=0; i<1024; i++) {
      print j;
      j += 3;
      if(j>1024) j-=1024;
    }
    // table size = 1024
    // interval = 3
    I tested that in awk, and it hits every number between 0-1023.

  27. #172
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    892
    Thanks
    492
    Thanked 280 Times in 120 Posts
    It's about the same.
    Since 3 and 1024 are relative primes, a single scan is enough to reach all states once, and only once.

    The issue at stake is to "preserve the order". So let me give an example.

    If you have a Symbol with a probability of 100/1024, and decide to position it at the beginning of your scale,
    the resulting symbols will be spread 0 & 297. The sorting is preserved.

    Now, if for some reason, your symbol is positioned at 300, and therefore occupy slots from 300 to 399,
    then there is an issue : 341x3 = 1023, but 342x3 = 2 (modulo 1024).
    So the simple list from 300 to 399 no longer preserves the order.
    It should be 342-399, then 300-341.

    And that's a simple example, because your sample step is 3.
    But, for better spreading capabilities, I would recommend another value, such as 643.
    In such case, the order is much more difficult to track.

  28. #173
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Cyan View Post
    It's about the same.
    Since 3 and 1024 are relative primes, a single scan is enough to reach all states once, and only once.

    The issue at stake is to "preserve the order". So let me give an example.

    If you have a Symbol with a probability of 100/1024, and decide to position it at the beginning of your scale,
    the resulting symbols will be spread 0 & 297. The sorting is preserved.

    Now, if for some reason, your symbol is positioned at 300, and therefore occupy slots from 300 to 399,
    then there is an issue : 341x3 = 1023, but 342x3 = 2 (modulo 1024).
    So the simple list from 300 to 399 no longer preserves the order.
    It should be 342-399, then 300-341.

    And that's a simple example, because your sample step is 3.
    But, for better spreading capabilities, I would recommend another value, such as 643.
    In such case, the order is much more difficult to track.
    I picked 3 for the example for no reason at all besides to simplify the example.

    On the other hand, I'm not sure how to choose the best step size for that algorithm.

    The symbol order, at least in my broken-out custom spread function, is not necessary to preserve. In fact, the output is just permuted symbols, so equal symbols are unordered.

    Edit:
    My reason for thinking that spread function might do ok was that it would make a sequence of passes over the symbols, taking a sparse sample on each pass. The step size would be a parameter. I knew it would screw up the boundaries between symbols sometimes. But it would be fast and simple.
    Last edited by nburns; 22nd February 2014 at 05:14.

  29. #174
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    My LZ compressor returns 4 buffers: literal lenghts (LL), literals (LIT), match lengths (ML) and offsets (OFF). I'm interested in very fast decompression using ANS/FSE and I see two possibilities:

    1. I can compress whole buffers using FSE_compress and FSE_compressU16, but it requires to decode four whole buffers with FSE_decompress before actual LZ decompression
    2. I can compress backwardly LL+LIT+ML+OFF step-by-step, what makes step-by-step decompression possible (without additional buffers, in a way similar to Huffman coding)

    I think that second option should be faster.
    Is it possible with current FSE API?
    Last edited by inikep; 17th November 2014 at 20:02.

  30. #175
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    As I understand, you have four buffers (with different statistical behavior) and you would like to use four separate order 0 byte-wise models?

    If so and the speed is of the essence, the option 1 (encoding buffers separately) is definitely better (and you can directly use FSE): decoder doesn't need to switch between contexts and 4 of such tables would rather not simultaneously fit into L1 cache.
    The cost is memory need for these buffers, but e.g. 4x32kB doesn't sound bad (using larger blocks would probably worsen compression ratio as it would be less adaptive to local situation).
    The cost of storing the final state can be compensated by placing some information in the initial state.

    If you choose option 2, first store sequence of pairs (symbol, 0-3 buffer number), then encode them in backward order ... it isn't supported by current FSE, but it is a matter of simple modification: preparing 4 tables and then adding the index to the tables, e.g. decodingTable[x] becomes decodingTable[buffer_number][x].
    Last edited by Jarek; 18th November 2014 at 11:25.

  31. #176
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    892
    Thanks
    492
    Thanked 280 Times in 120 Posts
    I think that second option should be faster.
    Is it possible with current FSE API?
    Yes it is. See section :
    FSE streaming API

    I'm not sure if it will be faster.
    But it will remove the need for intermediate buffers, which is in itself a worthy goal.

  32. #177
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Thanks Jarek and Cyan. I think I will experiment with both possibilities.

  33. #178
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    I've implemented 2 possibilities and the option with decompression of whole buffers is faster, because of optimizations in FSE_decompressStreams_usingDTable_generic().

    The results using 1 core of Intel Core i5-4300U@2.5 GHz, Windows 8.1 64-bit (MinGW-w64 Win-builds compilation under gcc 4.8.2) and 3 iterations. The input file is ENWIK8.

    Code:
    FSE 2014-10-16 [64 KB]  379 ms (257 MB/s), 63395180, 238 ms (410 MB/s)
    FSE 2014-10-16 [95 MB]  371 ms (263 MB/s), 64132507, 226 ms (432 MB/s)
    lzmax16_byte            2622 ms (37 MB/s), 43413551, 53 ms (1842 MB/s)
    lzmax16_FSE             2351 ms (41 MB/s), 36963903, 204 ms (478 MB/s)
    zlib 1.2.8 -1           1703 ms (57 MB/s), 42298774, 466 ms (209 MB/s)
    zlib 1.2.8 -6           5858 ms (16 MB/s), 36548921, 455 ms (214 MB/s)
    lzmax_byte is similar to LZ4hc. The introduction of FSE gave 15% improvement in compression, but decompression speed is 4 times lower.
    The compression ratio, however, is similar to zlib 1.2.8 -6, but with over two times faster decompression.

  34. Thanks (2):

    Cyan (27th November 2014),Jarek (26th November 2014)

  35. #179
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    zhuff is LZ4+FSE and gets 34907478 on WIKI8. Could you compare their speed?

    ps. If you would give some link to exec, I can add lzmax to the list ( http://encode.su/threads/2078-List-o...mplementations ).

  36. #180
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    ENWIK8, the same laptop:
    Code:
    *** Zhuff 64-bits v0.99beta, by Yann Collet (Aug 11 2014) ***
    Nb of threads = 1 ; Compression Level = 0
    z:\zhuff\enwik8 : 100000000 ->  37052920 (37.05%),  115.7 MB/s ,  467.3 MB/s
    
    Nb of threads = 1 ; Compression Level = 1
    z:\zhuff\enwik8 : 100000000 ->  35837866 (35.84%),   31.4 MB/s ,  474.2 MB/s
    
    Nb of threads = 1 ; Compression Level = 2
    z:\zhuff\enwik8 : 100000000 ->  34876724 (34.88%),   20.7 MB/s ,  496.0 MB/s

Page 6 of 7 FirstFirst ... 4567 LastLast

Similar Threads

  1. Replies: 1
    Last Post: 17th February 2014, 23:05
  2. ENTROPY 0.5
    By comp1 in forum Download Area
    Replies: 3
    Last Post: 7th February 2013, 00:21
  3. Parametric approximation of entropy(rate) function
    By Shelwien in forum Data Compression
    Replies: 0
    Last Post: 7th April 2012, 23:58
  4. State tables
    By Piotr Tarsa in forum Data Compression
    Replies: 3
    Last Post: 29th January 2012, 23:14
  5. What is best for a pure Entropy Encoder?
    By biject.bwts in forum Data Compression
    Replies: 4
    Last Post: 28th September 2011, 13:45

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •