Results 1 to 21 of 21

Thread: FPC - Fast Prefix Coder

  1. #1
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    101
    Thanks
    37
    Thanked 28 Times in 19 Posts

    FPC - Fast Prefix Coder

    FPC is a new prefix coder(Huffman)

    It has adaptive block subdivision(something like optimal parsing but for blocks) and uses package merge.
    Compression ration is hurt because of the simple bit length encoding(4 bits per length plus RLE)

    more info https://github.com/algorithm314/FPC

    enwik8

    i7-6500U


    Code:
    16KB block
    100000000 -> 63155808, 63.16% of original,ratio = 1.583
    compression speed 438.36 MB/s, decompression speed 731.63 MB/s
    
    32KB block
    100000000 -> 63415354, 63.42% of original,ratio = 1.577
    compression speed 503.01 MB/s, decompression speed 803.38 MB/s
      
    adaptive
    100000000 -> 62662924, 62.66% of original,ratio = 1.596
    compression speed 49.70 MB/s, decompression speed 698.06 MB/s
      
    huff0
    100000000 ->  63258831 (63.26%),  576.3 MB/s ,  694.0 MB/s 
    
    actual output file size for huff0 (non benchmark mode) 63423952
    for fpc actual file size difference is 4 bytes
    (2 bytes magic number + 2 bytes end of stream)
    orange pi pc plus allwinner h3

    Code:
    16KB block
    100000000 -> 63155808, 63.16% of original,ratio = 1.583
    compression speed 55.50 MB/s, decompression speed 78.11 MB/s
    
    32KB block
    100000000 -> 63415354, 63.42% of original,ratio = 1.577
    compression speed 59.00 MB/s, decompression speed 79.64 MB/s
    
    adaptive
    100000000 -> 62662924, 62.66% of original,ratio = 1.596
    compression speed 4.04 MB/s, decompression speed 72.70 MB/s
    
    huff0
    100000000 ->  63258831 (63.26%),   52.5 MB/s ,   52.6 MB/s
    
    actual output file size for huff0 (non benchmark mode) 63423952
    for fpc actual file size difference is 4 bytes
    (2 bytes magic number + 2 bytes end of stream)


  2. Thanks (4):

    Bulat Ziganshin (17th July 2017),Cyan (18th July 2017),encode (17th July 2017),JamesB (17th July 2017)

  3. #2
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    101
    Thanks
    37
    Thanked 28 Times in 19 Posts
    An other benchmark having less compression ratio only compared to fpaq0_sh.(fsehuf does not work in turbobench)
    file gcc

    Code:
    FPC adaptive
    832120 -> 518942, 62.36% of original,ratio = 1.603
    compression speed 50.86 MB/s, decompression speed 605.78 MB/s
    
    FPC 16KB
    832120 -> 533527, 64.12% of original,ratio = 1.560
    compression speed 411.60 MB/s, decompression speed 746.54 MB/s
          
          499446    60.0      25.99      33.54   fpaq0p_sh                        gcc
          524349    63.0     344.04    1065.81   TurboANX                         gcc
          529306    63.6     568.83     756.34   TurboHF                          gcc
          532636    64.0      35.63      23.52   subotin                          gcc
          539251    64.8     261.76     568.75   rans_static16                    gcc
          539770    64.9     319.67     488.78   fse                              gcc
          540494    65.0     181.90     220.38   zlibh                            gcc
          553739    66.5     130.48      40.12   FastAC                           gcc
          564843    67.9      91.83      65.65   FastHF                           gcc

  4. #3
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    Quote Originally Posted by algorithm View Post
    fsehuf does not work in turbobench)[code]
    It will work, when you increase the size 1024 of the wokspace (unsigned workSpace[1024]) to 2048 in the file "zstd/lib/compress/huf_compress.c".


  5. #4
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    101
    Thanks
    37
    Thanked 28 Times in 19 Posts
    @dnd Thank you for your reply.

    Also Turbobench prints S and stops working if i provide to it more than 2 codecs.I have the linux build from your site.

    It would be grate if you add FPC to turbobench.You can use the comp_block,dec_bloc functions.Let me know if you want any help like put main in a seperate file.

  6. #5
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    To avoid cpu trottling, turbobench is printing "S" and pausing for ~20 seconds (after each 2 minutes).
    To have precise timings for decompression, it is recomanded to use the option "-J31" (20 sec. pause at 8,16,24 run).
    I have already integrated FPC in turbobench by surrounding the main block with "#ifdef TEST, #endif" and created a c header file "fpc.h" for "comp_block and dec_block".
    It is better to make the changes yourself, so FPC can be simply included as a submodule.

  7. Thanks:

    algorithm (18th July 2017)

  8. #6
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    101
    Thanks
    37
    Thanked 28 Times in 19 Posts
    Thank you very much dnd.I updated FPC now it is a library for easier integration.

  9. #7
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    101
    Thanks
    37
    Thanked 28 Times in 19 Posts
    New version. Now adaptive mode is 3x faster at small cost in compression ratio.Also added big endian support.
    Code:
    enwik8
    100000000 -> 62786986, 62.79% of original,ratio = 1.593
    compression speed 151.36 MB/s, decompression speed 719.02 MB/s

  10. #8
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    101
    Thanks
    37
    Thanked 28 Times in 19 Posts
    Updated FPC

    Changes:
    new API
    added safe decoding
    improved compression ratio for binary files in adaptive mode
    faster decoding in arm
    bug fixes

    changed dec_bloc and comp_block functions to FPC_decompress(out,outlen,in,inlen) and FPC_compress(out,in,inlen,blev*1024) in turbobench

    results
    Code:
       124509362    58.7      24.38      31.32   fpaq0p_sh                        silesia.tar
       128069476    60.4      50.90     748.31   fpc 0  adaptive step 1024        silesia.tar
       128292454    60.5     154.35     762.67   fpc 0  adaptive step 2048        silesia.tar
       129091350    60.9      32.39      19.75   subotin                          silesia.tar
       129652522    61.2     470.33     772.69   fpc 16                           silesia.tar
       129975267    61.3     337.69     444.57   fse                              silesia.tar
       129985922    61.3     213.44     217.51   zlibh                            silesia.tar
       130055166    61.4     260.54     597.54   rans_static16                    silesia.tar
       130166079    61.4     545.33     843.62   fpc 32                           silesia.tar
       130369251    61.5     235.17     208.49   fsc                              silesia.tar
       130733727    61.7     135.96      37.44   FastAC                           silesia.tar
       131945453    62.3      97.45      71.06   FastHF                           silesia.tar
    fsehuf crashes even if I change woskSpace[1024] to 2048
    fse -b -h silesia.tar crashes with
    !! Error decompressing block 312 of cSize 13315 !! => (Corrupted block detected)
    but non-benchmark mode produces 130135946 bytes

  11. Thanks (2):

    JamesB (7th August 2017),Stephan Busch (7th August 2017)

  12. #9
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    186
    Thanked 177 Times in 120 Posts
    Built in on-the-fly RLE coding can be very quick and saves space in this example (silesia.tar), although it's not really a pure entropy encoder at that point.

    Block size can also greatly affect both speed and also ratio. (More blocks means better tuning to the varying nature through the data set, but comes with a price of more frequency tables to encode or more adaption time.) It looks like you're using 32Kb or so block size? That's a reasonable middle-ground I think.

    My rans_static16 code had a very poor encoding for order-1 frequency tables that means on 32Kb block size order-1 encoding is larger than order-0! I have a "pr" variant that includes a tighter order-1 frequency table encoding and yielded 102531939 bytes for order-1 at 32Kb block size, or 97150768 bytes for order-1 at 256Kb (95637321 with RLE). Alas I didn't implement that code yet in the 32-way AVX2 variant, otherwise it'd fly and be pretty well compressed too.
    Last edited by JamesB; 7th August 2017 at 20:47.

  13. #10
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    186
    Thanked 177 Times in 120 Posts
    I took a look at how it works. Nice job.

    It looks like it's using a fast huffman approach of lookup tables and bit shifting, but with interleaved streams which is really where the impressive speed comes from (I tried 1, 2, 3 and 4 and see why you settled on 3). The use of prefetch is a nice idea too (although on my machine it made no difference). I wonder if having more streams with SIMD instructions could work for even greater speed.


    Edit: NUM_STREAMS == 4 seems to be marginally faster (core-i5) is the while loop is changed to only check the final stream rather than all stream end points. Given the serialisation, isn't this a valid condition? Eg:

    Code:
            while(likely(dest < dest_end REPEAT_ARG(NUM_STREAMS,TEST_STREAM_END))){
    vs

    Code:
            while(likely(dest < dest_end && stream_pos3 <= stream_end)) {
    Additionally, the renormalise loop doesn't need to PREFETCH each and every loop as it fetches entire cache lines which are typically 128 bit (I think) so alternating is sufficient. I added a hack like this, with manual unrolling and reordering of the renorm dec code:

    Code:
                    if (x & 1 == 0) {
                        PREFETCH(stream_pos0+PDIST);
                        PREFETCH(stream_pos1+PDIST);
                    } else {
                        PREFETCH(stream_pos2+PDIST);
                        PREFETCH(stream_pos3+PDIST);
                    }
                    x++;
    
                    bits0 |= LARCH_LE(stream_pos0) << bits_av0;                                
                    bits1 |= LARCH_LE(stream_pos1) << bits_av1;                                
                                                                                               
                    stream_pos0 += (SUB_CONST - bits_av0) >> 3;                                
                    stream_pos1 += (SUB_CONST - bits_av1) >> 3;                                
                                                                                               
                    bits_av0 |= OR_CONST;                                                      
                    bits_av1 |= OR_CONST;                                                      
                                                                                               
                    bits2 |= LARCH_LE(stream_pos2) << bits_av2;                                
                    bits3 |= LARCH_LE(stream_pos3) << bits_av3;                                
                                                                                               
                    stream_pos2 += (SUB_CONST - bits_av2) >> 3;                                
                    stream_pos3 += (SUB_CONST - bits_av3) >> 3;                                
                                                                                               
                    bits_av2 |= OR_CONST;                                                      
                    bits_av3 |= OR_CONST;                                                      
                                                                                               
                    REPEAT(RENORM_NUM,                                                         
                            REPEAT_ARG(NUM_STREAMS,PREFIX_DEC))
    On silesia.tar I get decompression speed 860.77 MB/s (-b16) and 991.82 (-b63), compared to 846.22 MB/s and 966.72 MB/s for the original.
    Last edited by JamesB; 7th August 2017 at 20:46.

  14. Thanks:

    algorithm (7th August 2017)

  15. #11
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    186
    Thanked 177 Times in 120 Posts
    Gah, well it would be better if I had "(x & 1) == 0" instead of "x & 1 == 0", oops. Although bizarrely that slows it up a bit. Maybe the prefetching isn't actually helping! Anyway, not doing it every time is a win.

  16. #12
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    101
    Thanks
    37
    Thanked 28 Times in 19 Posts
    Thank you for looking into the code.

    Prefetching in my tests does not help for intel cpus but provides a significant gain for arm cortex a53.In my tests 140MB/s vs 160MB/s, fsehuf is 120MB/s.
    Maybe urolling a bit may help.I will test it.I think cache lines are 64 or 32 bytes depending on architecture.

    I use 3 streams as default because speed increase for 4 streams is insignificant for intel but more streams => more register presure => slower when fewer registers are available(x86) or when register renaming is not present as this is basically fewer registers.This also relies on register allocation of the compiler.For example while for initial versions of fpc gcc was faster now clang is a bit faster.

    Default block size is 16KB.

    For the safety check i dont think stream_pos2 <= stream_pos3 because maybe you can provide somehow long prefix codes for stream2 and small for stream3(bit alignment is significant) and it may surpass stream3.But i am not sure.

    I mostly design FPC for good performance across many architectures and compilers.

  17. #13
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    186
    Thanked 177 Times in 120 Posts
    Ah I see what you're saying and a good spot. Although stream_pos##A are all pointers along the same allocated buffer, ending at stream_end (and initialised in DEC_INIT), there is no guarantee given arbitrary or malicious data that stream_pos1 would never overtake stream_pos2; eg if we have lots of long codes in stream 1 and short ones in stream 2 but with stream lengths have been incorrectly specified. All the checks therefore are indeed required.

    You've already dealt with the scenario of only checking periodically by decrementing stream_end by the worst-case distance, and then having a smaller more correct loop to handle the remaining cases. I like it.

  18. #14
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    876
    Thanks
    474
    Thanked 175 Times in 85 Posts
    Quote Originally Posted by algorithm View Post
    Updated FPC

    Changes:
    new API
    added safe decoding
    improved compression ratio for binary files in adaptive mode
    faster decoding in arm
    bug fixes

    changed dec_bloc and comp_block functions to FPC_decompress(out,outlen,in,inlen) and FPC_compress(out,in,inlen,blev*1024) in turbobench

    results
    Code:
       124509362    58.7      24.38      31.32   fpaq0p_sh                        silesia.tar
       128069476    60.4      50.90     748.31   fpc 0  adaptive step 1024        silesia.tar
       128292454    60.5     154.35     762.67   fpc 0  adaptive step 2048        silesia.tar
       129091350    60.9      32.39      19.75   subotin                          silesia.tar
       129652522    61.2     470.33     772.69   fpc 16                           silesia.tar
       129975267    61.3     337.69     444.57   fse                              silesia.tar
       129985922    61.3     213.44     217.51   zlibh                            silesia.tar
       130055166    61.4     260.54     597.54   rans_static16                    silesia.tar
       130166079    61.4     545.33     843.62   fpc 32                           silesia.tar
       130369251    61.5     235.17     208.49   fsc                              silesia.tar
       130733727    61.7     135.96      37.44   FastAC                           silesia.tar
       131945453    62.3      97.45      71.06   FastHF                           silesia.tar
    fsehuf crashes even if I change woskSpace[1024] to 2048
    fse -b -h silesia.tar crashes with
    !! Error decompressing block 312 of cSize 13315 !! => (Corrupted block detected)
    but non-benchmark mode produces 130135946 bytes
    Could you please provide a windows compile?
    My compile done with "gcc -O3 fpc.c cli.c" does not work.

  19. #15
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    277
    Thanks
    116
    Thanked 159 Times in 116 Posts
    Quote Originally Posted by Stephan Busch View Post
    Could you please provide a windows compile?
    My compile done with "gcc -O3 fpc.c cli.c" does not work.
    Makefile says these compiler options: -Wall -O2 -std=c99 -DNDEBUG

  20. #16
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    876
    Thanks
    474
    Thanked 175 Times in 85 Posts
    compiling with these options work, but compression does not take place (as in my compile)

  21. #17
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    101
    Thanks
    37
    Thanked 28 Times in 19 Posts
    Quote Originally Posted by Stephan Busch View Post
    compiling with these options work, but compression does not take place (as in my compile)
    I will test it tomorrow.What to install for gcc or clanng?mingw?

  22. #18
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    876
    Thanks
    474
    Thanked 175 Times in 85 Posts
    I installed MinGW with GCC

  23. #19
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    101
    Thanks
    37
    Thanked 28 Times in 19 Posts
    I think the problem was that fopen needs rb instead of r for windows.Uploaded binaries at github.
    Also added options for prefetch and unrolling.Unrolling gives a small speedup for 64 bit but is a bit slower for 32bit.

  24. Thanks (2):

    JamesB (9th August 2017),Stephan Busch (9th August 2017)

  25. #20
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    101
    Thanks
    37
    Thanked 28 Times in 19 Posts
    A test with skewed distributions.Silesia.tar after bwt and after bwt with xrle.
    Code:
              47459802    22.4      12.82      10.43   bcmec                            sil.bwt
            54241131    25.6      31.51      43.86   fpaq0p_sh                        sil.bwt
            68592711    32.4      47.50     610.01   fpc 0  adaptive step 1024        sil.bwt
            70826937    33.4     132.10     654.10   fpc 0  adaptive step 2048        sil.bwt
            82717445    39.0     517.83     888.38   fpc 16                           sil.bwt
            83885237    39.6      46.73      28.63   subotin                          sil.bwt
            84931281    40.1     368.39     495.49   fse                              sil.bwt
            84989980    40.1     297.02     607.88   rans_static16                    sil.bwt
            87973514    41.5     586.90     944.93   fpc 32                           sil.bwt
            96758195    45.6     162.70      45.67   FastAC                           sil.bwt
           105699061    49.9     111.47      77.53   FastHF                           sil.bwt
        
            49359005    41.9      11.56       8.95   bcmec                            sil.bwt.rle
            56407148    47.9      25.59      32.82   fpaq0p_sh                        sil.bwt.rle
            63073531    53.6      46.92     502.26   fpc 0  adaptive step 1024        sil.bwt.rle
            63902864    54.3     128.33     530.06   fpc 0  adaptive step 2048        sil.bwt.rle
            68610075    58.3     445.26     801.09   fpc 16                           sil.bwt.rle
            69858950    59.4      34.68      21.07   subotin                          sil.bwt.rle
            70724467    60.1     341.12     596.24   rans_static16                    sil.bwt.rle
            70735616    60.1     334.02     437.76   fse                              sil.bwt.rle
            70927060    60.3     529.93     837.46   fpc 32                           sil.bwt.rle
            74569795    63.4     130.42      39.11   FastAC                           sil.bwt.rle
            76516024    65.0      90.73      66.75   FastHF                           sil.bwt.rle

  26. #21
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    Quote Originally Posted by algorithm View Post
    fsehuf crashes even if I change woskSpace[1024] to 2048
    In my tests, it is working:
    under Ubuntu 17.04, 64 bits with gcc 6.3 and under windows with mingw64 + gcc 7.1
    and several different intel cpu's

    I've done some changes to call other FSEHUF internal functions.
    It is working now without any changes to "huf_compress.c".

  27. Thanks:

    algorithm (12th August 2017)

Similar Threads

  1. On the Q Coder
    By thorfdbg in forum Data Compression
    Replies: 22
    Last Post: 29th December 2016, 17:27
  2. range coder fragility
    By cr88192 in forum Data Compression
    Replies: 0
    Last Post: 8th January 2016, 19:01
  3. How fast should be a range coder ?
    By Cyan in forum Data Compression
    Replies: 33
    Last Post: 16th November 2009, 16:02
  4. FPC
    By Black_Fox in forum Data Compression
    Replies: 2
    Last Post: 12th July 2008, 14:59
  5. High Throughput Compression (FPC vs DFCM)...
    By Raymond in forum Forum Archive
    Replies: 0
    Last Post: 5th March 2007, 18:02

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •