Results 1 to 17 of 17

Thread: Tidiest free-standing AC/RC code

  1. #1
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts

    Tidiest free-standing AC/RC code

    Can the old hands each paste what they believe to be the cleanest, concise and correct arithmetic coders and give their opinion as to its merits, efficiency, performance and constraints? It would be a public service and much appreciated and very directly-useful and instructive to newbies like myself.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    0. "Introduction to rangecoding"
    http://www.ctxmodel.net/rem.pl?-37

    1. Bitwise rc. Afaik this is really the most compact one(s):
    http://encode.su/threads/1153-Simple...angecoder-demo

    2. Bytewise rc.
    http://compression.ru/sh/coders6c2.rar (newest update - 2003)
    http://compression.ru/sh/coders6b.rar
    http://compression.ru/sh/coders6a.rar
    http://compression.ru/sh/aridemo6.rar

    3. Speed-optimized bitwise
    http://nishi.dreamhosters.com/u/fpaq0pv4b4.rar
    http://nishi.dreamhosters.com/u/o01_v0.rar (coroutine)
    http://nishi.dreamhosters.com/u/vecrc_v0.rar (multiple interleaved rcs with vectorizable update)

    4. Usage examples

    Bytewise CM (uses sh_v1m from coders6c2):
    http://encode.su/threads/541-Simple-...xt-mixing-demo
    http://ctxmodel.net/files/green/green_v3a.rar

    PPMd_sh (sh_v1m/sh_v1x):
    http://www.ctxmodel.net/files/PPMd/ppmd_Jr1_sh8.rar

    Bsc( libbsc\qlfc\core\coder.h, ShiftLow from coders6b.rar\src\shindlet.cdr )
    http://libbsc.com/Documents/bsc-2.8.0-x86.zip (last? version without cuda)
    http://nishi.dreamhosters.com/u/bsc240_qlfc_v2.rar (standalone bsc postcoder)

    Lzma( LzmaEnc.c, ShiftLow from coders6a.rar\aridemo\src\shindlet.cdr )
    http://nishi.dreamhosters.com/u/lzma.rar

    5. Some history

    Original rangecoder is http://web.archive.org/web/201004220...r/rngcod13.zip
    The function that does output in encoder is merged with state renormalization and called "enc_normalize".
    I refactored that source first for http://compression.ru/sh/aridemo3.rar
    And ShiftLow was factored out in aridemo6 - that's my naming, so all rcs using it are derivatives.

    Original ppmd is based on another popular implementation - Subbotin's carryless rc,
    ppmonstr uses likely Shkarin's own port of Schindler's rc, its different from mine.

    ccm is also likely based on original Schindler's rc, guessing from 31-bit range.

  3. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    See also fpaq0.c ( http://mattmahoney.net/dc/#fpaq0 ) for the case of encoding 1 bit at a time and writing 1 byte at a time using 32 bit range and no carries. It is also the coder used in most PAQ versions.

    ZPAQ uses a similar encoder but is a little cleaner with regard to EOF. However it has the extra complication that it never encodes 4 zero bytes in a row, allowing it to be used to mark the end of the data. It does this by setting the low end of the range to at least 1. fpaq0 and PAQ require that the uncompressed length be stored separately so that the encoder and decoder know when to stop. ZPAQ does not store the length. Instead it encodes an EOF bit after each byte with probability 0 when EOF is reached. This flushes the encoder state, so there is no need to explicitly flush it. During decoding when the EOF bit is decoded, the decoder reads the next 4 bytes, which should be all zeros, which should never occur otherwise. The encoder and decoder is easiest to understand from the decoder specification in http://mattmahoney.net/dc/zpaq1.pdf
    The encoder and decoder source code can be found in libzpaq from http://mattmahoney.net/dc/zpaq.html

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    Matt, please look at fpaq0p_mm.cpp in http://encode.su/threads/1153-Simple...angecoder-demo
    and tell us what it lacks comparing to http://mattmahoney.net/dc/fpaq0.cpp

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Well, just a couple things. fpaq0p_sh has 5 state variables, where fpaq0 and fpaq0p have 3. Also fpaq0p is faster (although not compressing quite as well).

    Code:
    61,457,810 enwik8.fpaq0p, 11.5s, 12.1s
    61,320,547 enwik8.fpaq0p_mm, 19.6s, 21.0s
    61,250,814 enwik8.fpaq0p_sh, 17.2s, 23.5s
    This was after recompiling everything with mingw g++ 4.6.1 -O3 -fomit-frame-pointer -s on a 2.0 GHz T3200, Win32. The original .exe were slower, around 37-40s.

    But that is a nice use of templates to remove duplicate code

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Just to be fair, it's faster than zpaq. But zpaq is doing other stuff like computing and verifying SHA1 checksums.

    Code:
    61,290,591 enwik8-mo08.zpaq, 36.2s, 45.6s
    61,290,689 enwik8-mo08-b50.zpaq, 21.8s, 24.4s
    -b50 splits the input into 2 50MB blocks which are compressed and decompressed in parallel. o08.cfg is equivalent to the model used in fpaq0p, an adaptive order 0 model with a learning rate of 1/32.

    Code:
    comp 0 0 0 0 1
      0 cm 9 8
    hcomp
      halt
    post
      0
    end

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    > Well, just a couple things. fpaq0p_sh has 5 state variables,

    With a full rangecoder (not carryless) decoding is simpler and (normally) faster.
    Compare:
    Code:
    // paq-like carryless
    unsigned xmid = x1 + ((x2-x1) >> SCALElog) * P;
    if( x<=xmid ) y=1, x2=xmid; else y=0, x1=xmid+1;
    while( ((x1^x2)&0xFF000000)==0 ) x1<<=8, x2=(x2<<8)+0xFF, x=(x<<8)+getc(f);
    Code:
    // Full rangecoder
    uint rnew = (range>>SCALElog)*freq;
    (bit=(code>=rnew)) ? range-=rnew, code-=rnew : range=rnew;
    while( range<0x01000000 ) range<<=8, (code<<=8)+=getc(f);
    So only 2 state vars here in decoding (code and range),
    and there's an option to replace the renorm loop with 2 ifs (or do 16-bit renorm with 1 if),
    which is impossible for that carryless.

    > where fpaq0 and fpaq0p have 3.

    Yes, encoding is more complicated, but imho its worth that.

    > Also fpaq0p is faster (although not compressing quite as well).

    Actually fpaq0p can't be faster than fpaq0p_mm, because they're
    very similar, but the latter has to (de)code 8 bits per byte vs 9 bits in fpaq0p (+EOF flag).
    fpaq0p_sh has some excuses for being slower (higher precision in range update -
    (qword(range)*(freq<<(32-SCALElog)))>>32 vs (range>>SCALElog)*freq )
    and the proc() template which compilers don't really like.

    So if fpaq0p is actually faster, it just means that something is wrong with
    your compiler (excessive unrolling?) or test environment

    I did some tests too: http://nishi.dreamhosters.com/u/rc_v0_test.rar
    Code:
    *0 = IC 11.1 O3 Qipo PGO
    *1 = IC 11.1 O3 Qipo- PGO
    *4 = IC 11.1 O3 Qipo
    *5 = IC 11.1 O3 Qipo-
    *2 = tdm-gcc 4.5.1 -O3 -fomit-frame-pointer -fno-exceptions -fno-rtti -mtune=pentium2 -fwhole-program -s
    *3 = tdm-gcc 4.6.1 -O3 -fomit-frame-pointer -fno-exceptions -fno-rtti -mtune=pentium2 -fwhole-program -s
    
    fpaq0p0 5.235s 6.156s  fpaq0p_mm0 4.375s 5.593s  fpaq0p_sh0 4.422s  5.688s
    fpaq0p1 5.734s 6.282s  fpaq0p_mm1 4.469s 5.687s  fpaq0p_sh1 4.843s  5.406s  
    fpaq0p4 5.421s 5.579s  fpaq0p_mm4 4.484s 6.047s  fpaq0p_sh4 4.969s  8.843s
    fpaq0p5 5.421s 5.579s  fpaq0p_mm5 3.766s 5.656s  fpaq0p_sh5 3.985s  6.094s
    
    fpaq0p2 6.218s 6.188s  fpaq0p_mm2 3.844s 4.656s  fpaq0p_sh2 7.563s 11.047s
    fpaq0p3 5.734s 6.188s  fpaq0p_mm3 6.859s 7.047s  fpaq0p_sh3 7.860s 10.843s
    And some more after fixing up fpaq0p_sh a little:

    Code:
    //  uint rnew = (qword(range)*(freq<<(32-SCALElog)))>>32;
        uint rnew = (range>>SCALElog)*freq;
    
    61250814 -> 61251140
    fpaq0p_sh6 4.188s 8.328s  /Qipo 
    fpaq0p_sh7 3.797s 5.437s  /Qipo-
    fpaq0p_sh8 4.141s 4.953s  /Qipo  /Qprof-use
    fpaq0p_sh9 3.953s 5.156s  /Qipo- /Qprof-use
    fpaq0p_shA 7.422s 7.906s  gcc 4.5.1
    fpaq0p_shB 7.484s 8.016s  gcc 4.6.1
    > The original .exe were slower, around 37-40s.

    Yes, it was a size-opt build with VC6.

    > But that is a nice use of templates to remove duplicate code

    Thanks, I've been writing my coders like that for years though.

    And anyway, I wasn't aiming for speed here (or I'd not use getc/putc i/o).

    So the question is whether you insist that the original fpaq0 or fpaq0p
    would be better as an rc demo for a novice.

  8. #8
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by willvarfar View Post
    Can the old hands each paste what they believe to be the cleanest, concise and correct arithmetic coders and give their opinion as to its merits, efficiency, performance and constraints? It would be a public service and much appreciated and very directly-useful and instructive to newbies like myself.
    Well http://bijective.dogma.net/arb255.zip May not be cleanest, concisest arithmetic coder. It's very old code The website needs to be updated since its made for DjGPP code that does not even run on my current machine.

    However it is a pure bijective arithmetic coders. The others discussed here are not meaning they waste file space when doing simple compression. Also its a pure correct arithmetic coder in the sense that any permutation of the file your compressing goes to the same size gave or take ONE BYTE. Not sure the others can make that claim. So in might be instructive or useful to "newbies" like yourself. However its much worse in other measures of tidiness. You could clean it up a little and make it faster.

  9. #9
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Some time ago i've written a framework for a fast statistical compressors (see attatchment). I'm planning to make a very basic O0 coder, which includes the M1 optimizer, so one can easily build genetically optimized compressors.

    @Shelwien
    I'd suggest to use the make script included with for building GCC PGO executables of fpaq0p* to improve the compression speed (better compiler options). Just replace main.cpp from the attatched packed with fpaq0<whatever>.cpp.
    Attached Files Attached Files
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  10. #10
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    thx all. I was rather hoping for code to be pasted inline rather than links. Links have the bad habit of dying. Take for example from the top of the first reply: http://encode.su/threads/1153-Simple...ll=1#post22872 This contains a nice tidy range-coder. Can this be held aloft as solid, general purpose and borrowable? It could perhaps use inttypes.h to be more obvious what the datatype width of qword is and the range and such could be computed from this or ~0 and other portability bits? What min and max values for the input? (As in, spell it out exceedingly obviously) Can people paste the concise coder they think is the Tidiest free-standing AC/RC code?

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    > Can this be held aloft as solid, general purpose and borrowable?

    1. Please translate "held aloft as solid"

    2. "general purpose".
    fpaq0p_sh implements the same coding method as what lzma uses for literals
    (both counters and rc).
    But from my experience, every new task needs a new rc revision.
    For example, Matt tweaked his rc in zpaq to use 00 00 00 00 as
    some kind of signature (rc doesn't ever output it).
    And I have 3-4 versions of i/o interface there.
    Also there're lots of known applicable speed optimizations
    which typically affect compression.
    Still, this rc can encode and decode any file <4G in size
    (it uses a 32-bit filesize).

    3. "borrowable".
    Public domain, yeah.

    > It could perhaps use inttypes.h to be more obvious what the datatype
    > width of qword is and the range and such could be computed from this
    > or ~0 and other portability bits?

    I suppose you can do that and post it?
    Though I really hate inttypes.h for historical and aesthetic reasons.

    Also I don't think that making it compatible with different
    int sizes would do any good, aside from turning it unreadable as zlib.

    > What min and max values for the input?
    > (As in, spell it out exceedingly obviously)

    freq=1..SCALE-1, smaller SCALE makes it faster but less precise.
    means a probability freq/SCALE of bit==0.

    bit = any int, treated as boolean; but normally its a bit, ie 0/1.

    > Can people paste the concise coder they think is the
    > Tidiest free-standing AC/RC code?

    I don't think they can, because normally it looks kind of
    like this - http://ezcodesample.com/ralpha/RunCoder1.txt

    Basically rc is not the simplest thing there is, especially when written
    in mainstream C++ style.

  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    OK, here is the encoder and decoder from libzpaq. It is public domain (actually MIT modified to remove the restriction of including the license, so same thing). It requires a bit of explanation. ZPAQ is an archiving format that mixes compressed and uncompressed data, so it needs a way to identify where the compressed data ends. It does this with a marker of 4 zero bytes. To support this, the encoder ensures that 4 zero bytes never appears in the compressed data.

    The method Encoder::compress(c) compresses a byte c in the range 0..255, or -1 at EOF. Decoder::decompress() returns a byte 0..255 or -1 at EOF. compress() encodes 9 bits either a 0 followed by 8 bits of c in MSB to LSB order, or a 1 to indicate EOF. Each bit of c is coded with a prediction, p, for that bit in the range 0..65535 estimating the probability that the bit is a 1 as a 16 bit number. For the EOF bit, p = 0. For the bits of c, the predictions come from a Predictor, which has the following 3 methods:

    p() returns next bit prediction in the range 0..32767 as a 15 bit number. The encoder scales this up to an odd 16 bit number, 2*p+1.

    update(bit) is passed 0 or 1 for the true value of the last predicted bit.

    init() initializes. A predictor is initialized with a ZPAQL object which contains code describing how to model the data.

    The encoder has initial state low = 1, high = 0xFFFFFFFF. To encode a bit with probability p (that it is a 1), the range is split into two parts, low..mid and mid+1..high, with a 1 encoded in the lower half with size proportional to p. The statement:

    Code:
    U32 mid=low+((high-low)>>16)*p+((((high-low)&0xffff)*p)>>16);
    multiplies in two 16 bit parts to avoid 32 bit overflow. It could be simplified using 64 bit arithmetic to

    Code:
    uint32_t mid=low+((uint64_t(high-low)*p)>>16);
    U32 is uint32_t. The statement

    Code:
    low+=(low==0);
    ensures that 4 zero bytes are never output. If you don't care, you could remove it from both the encoder and decoder.

    in and out are abstract data types Reader and Writer. A Reader has virtual method get() which returns a byte or -1 at EOF. Writer has put(c) which outputs a byte.

    As the range (high - low) gets smaller, the leading bytes of 32 bit values high and low are flushed so that they are always different. During compression, these bytes are written. During decompression, these bytes are discarded and a byte is read into the low end of curr. You can imagine that low and high are infinite precision numbers with implicit trailing 0 bits after low and 1 bits after high.

    The EOF bit preceding each byte is always modeled with p = 0, not using the predictor. This has the effect of setting high = low and forcing 4 bytes to be flushed and the new state high = 0xFFFFFFFF and low = one of 1, 0x100, 0x10000, or 0x1000000. During encoding, the application would then write 4 zero bytes. During decoding, these bytes would be read into curr = 0, resulting in curr < low, which would be an error (corrupted input) if it occurred anywhere else. The decoder uses curr = 0 as a signal to initialize the decoder by reading the first 4 bytes of input into curr.

    Code:
    // Encoder compresses using an arithmetic code
    class Encoder {
    public:
      Encoder(ZPAQL& z):
        out(0), low(1), high(0xFFFFFFFF), pr(z) {}
      void init();
      void compress(int c);  // c is 0..255 or EOF
      Writer* out;  // destination
    private:
      U32 low, high; // range
      Predictor pr;  // to get p
      void encode(int y, int p); // encode bit y (0..1) with probability p (0..65535)
    };
    
    // Initialize for start of block
    void Encoder::init() {
      low=1;
      high=0xFFFFFFFF;
      pr.init();
    }
    
    // compress bit y having probability p/64K
    void Encoder::encode(int y, int p) {
      assert(out);
      assert(p>=0 && p<65536);
      assert(y==0 || y==1);
      assert(high>low && low>0);
      U32 mid=low+((high-low)>>16)*p+((((high-low)&0xffff)*p)>>16); // split range
      assert(high>mid && mid>=low);
      if (y) high=mid; else low=mid+1; // pick half
      while ((high^low)<0x1000000) { // write identical leading bytes
        out->put(high>>24);  // same as low>>24
        high=high<<8|255;
        low=low<<8;
        low+=(low==0); // so we don't code 4 0 bytes in a row
      }
    }
    
    // compress byte c (0..255 or -1=EOS)
    void Encoder::compress(int c) {
      assert(out);
      if (c==-1)
        encode(1, 0);
      else {
        assert(c>=0 && c<=255);
        encode(0, 0);
        for (int i=7; i>=0; --i) {
          int p=pr.predict()*2+1;
          assert(p>0 && p<65536);
          int y=c>>i&1;
          encode(y, p);
          pr.update(y);
        }
      }
    }
    
    // Decoder decompresses using an arithmetic code
    class Decoder {
    public:
      Reader* in;  // destination
      Decoder(ZPAQL& z);
      int decompress();  // return a byte or EOF
      void init() {pr.init(); low=1; high=0xFFFFFFFF; curr=0;}
    private:
      U32 low, high; // range
      U32 curr;  // last 4 bytes of archive
      Predictor pr;  // to get p
      int decode(int p); // return decoded bit (0..1) with prob. p (0..65535)
    };
    
    Decoder::Decoder(ZPAQL& z):
      in(0), low(1), high(0xFFFFFFFF), curr(0), pr(z) {}
    
    // Return next bit of decoded input, which has 16 bit probability p of being 1
    int Decoder::decode(int p) {
      assert(p>=0 && p<65536);
      assert(high>low && low>0);
      if (curr<low || curr>high) error("archive corrupted");
      assert(curr>=low && curr<=high);
      U32 mid=low+((high-low)>>16)*p+((((high-low)&0xffff)*p)>>16); // split range
      assert(high>mid && mid>=low);
      int y=curr<=mid;
      if (y) high=mid; else low=mid+1; // pick half
      while ((high^low)<0x1000000) { // shift out identical leading bytes
        high=high<<8|255;
        low=low<<8;
        low+=(low==0);
        int c=in->get();
        if (c<0) error("unexpected end of file");
        curr=curr<<8|c;
      }
      return y;
    }
    
    // Decompress 1 byte or -1 at end of input
    int Decoder::decompress() {
      if (curr==0) {  // segment initialization
        for (int i=0; i<4; ++i)
          curr=curr<<8|in->get();
      }
      if (decode(0)) {
        if (curr!=0) error("decoding end of stream");
        return -1;
      }
      else {
        int c=1;
        while (c<256) {  // get 8 bits
          int p=pr.predict()*2+1;
          c+=c+decode(p);
          pr.update(c&1);
        }
        return c-256;
      }
    }
    Last edited by Matt Mahoney; 14th October 2011 at 00:54. Reason: added () to code for mid

  13. #13
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Also, I should mention why zpaq -mo08 is slower than fpaq0p. I mentioned that zpaq computes the sha1 checksum but I can't use that as an excuse

    Code:
    C:\tmp>timer fsum -sha1 enwik8
    
    Timer 3.01  Copyright (c) 2002-2003 Igor Pavlov  2003-07-10
    
    SlavaSoft Optimizing Checksum Utility - fsum 2.51
    Implemented using SlavaSoft QuickHash Library <www.slavasoft.com>
    Copyright (C) SlavaSoft Inc. 1999-2004. All rights reserved.
    
    ; SlavaSoft Optimizing Checksum Utility - fsum 2.51 <www.slavasoft.com>
    ;
    ; Generated on 10/13/11 at 16:59:47
    ;
    57b8363b814821dc9d47aa4d41f58733519076b2 ?SHA1*enwik8
    
    Kernel Time  =     0.078 = 00:00:00.078 =  12%
    User Time    =     0.514 = 00:00:00.514 =  84%
    Process Time =     0.592 = 00:00:00.592 =  97%
    Global Time  =     0.608 = 00:00:00.608 = 100%
    even though my implementation of SHA1 is 4 times slower for reasons I haven't yet figured out.

    But I think most of the slowdown is in the model. First, fpaq0p updates the prediction to reduce the error by 1/32 by using a right shift. However a ZPAQ CM uses 22-bit predictions and 10-bit counts packed into a table of 32 bit ints. The predictions are updated by reducing the prediction error by 1/count, indexed from a table of reciprocals. This improves compression for higher order models by adapting faster at first, but doesn't help much for order 0. The argument 8 in "CM 9 8" limits the count to 8*4.

    A second problem is that the order 0 context is expanded to 9 bits to improve cache locality. (Thus the 9 in "CM 9 8"). This helps for large tables but not order 0. The contexts in binary are:

    000000001 (cache miss)
    00000001x
    0000001xx
    000001xxx
    1xxxx0001 (cache miss)
    1xxxx001x
    1xxxx01xx
    1xxxx1xxx

    A third problem is that the output is squash(stretch(p)) (using 2 table lookups and a bounds check) rather than just p. The reason for doing it this way is that components that take the output of other predictions as inputs (AVG, MIX2, MIX, SSE, ISSE) need stretched (log(p/(1-p))) predictions. The output of the last component is then squashed (1/(1+exp(-p))) to invert the stretch() function for arithmetic coding. This seems wasteful for a single CM, but optimizing it out would introduce small rounding errors that would break archive compatibility.

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    Ok, here I added the missing parts:
    http://nishi.dreamhosters.com/u/o0zp_v0.rar

    Code:
    enwik8 compressed size = 61251146
    
           c.time  d.time
    o0zp1  7.938s  8.375s
    o0zp2  6.828s  8.297s
    o0zp3  8.156s  9.953s
    o0zp4  8.890s 10.110s
    
    *1 = IC 11.1 O3 Qipo PGO
    *2 = IC 11.1 O3 Qipo- PGO
    *3 = tdm-gcc 4.5.1 -O3 -fomit-frame-pointer -fno-exceptions -fno-rtti -mtune=pentium2 -fwhole-program -s
    *4 = tdm-gcc 4.6.1 -O3 -fomit-frame-pointer -fno-exceptions -fno-rtti -mtune=pentium2 -fwhole-program -s
    So this is 188 lines and 4,460 bytes of source, and fpaq0p_mm.cpp is 88 lines and 2,387 bytes and is 2x faster.
    Which is better?

  15. #15
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Yes, you're right. On my slower 2 GHz T3200 with g++ 4.6.1 -O3 -fomit-frame-pointer -s, o0zp enwik8 times are 18.1s, 22.7s. I tried a few optimizations:

    - Removing error checks in the decoder makes no difference.
    - Removing "low+=(low==0);" makes no difference in speed, compresses 1 byte smaller.
    - Replacing with a 64 bit multiply like this is 1.5s slower: U32 mid=low+((high-low)*(unsigned long long)p>>16);
    - But using an intermediate unsigned cast is 0.3s faster than original: U32 mid=low+((high-low)*(unsigned long long)(U32)p>>16);

    Edit: inlining encode() and decode(): 16.0s, 18.3s.
    Also inlining compress() and decompress(): 14.0s, 16.5s.

    Edit: replacing Reader and Writer with FILE*: 14.0, 20.6s ???
    Last edited by Matt Mahoney; 15th October 2011 at 01:04.

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    Apparently getc/putc on linux are much slower than on windows (likely not macros like VS ones), so here I made
    a limited replacement for these coders: http://nishi.dreamhosters.com/u/getcputc.inc
    You just have to include it after system headers.

    Code:
    // IC 11.1 O3 Qipo-
    // *2 versions use getcputc.inc
    
    fpaq0p_mm   3.797s 5.687s
    fpaq0p_mm2  3.719s 5.453s
    o0zp        6.828s 8.312s
    o0zp2       6.563s 8.344s
    And as to speed optimizations for o0zp, I'd suggest getting rid of EOF coding first.
    Duplicate context tracking in decompress() and Predictor::update is not very good either.
    Also Reader/Writer pointers are better to be replaced with Encoder/Decoder template parameters.

  17. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Forgot to mention my last tests were in Windows Vista 32 bit.

Similar Threads

  1. I'm looking for the best free implementation of deflate
    By caveman in forum Data Compression
    Replies: 2
    Last Post: 22nd November 2010, 09:27
  2. Code Optimisation
    By Cyan in forum Data Compression
    Replies: 18
    Last Post: 18th January 2010, 01:48
  3. Free software to support RARv3
    By lunaris in forum Data Compression
    Replies: 11
    Last Post: 21st January 2009, 20:13
  4. jZip – a free alternative to WinZip
    By LovePimple in forum Forum Archive
    Replies: 10
    Last Post: 31st August 2007, 14:05
  5. QUAD - LIVE FREE OR DIE!!!
    By encode in forum Forum Archive
    Replies: 7
    Last Post: 24th February 2007, 22:58

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •