Results 1 to 8 of 8

Thread: Ways to ensure that decoder ate all the bytes that encoder produced

  1. #1
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,488
    Thanks
    26
    Thanked 130 Times in 100 Posts

    Ways to ensure that decoder ate all the bytes that encoder produced

    If the title is not meaningful here is what I mean:
    - arithmetic encoding & decoding,
    - I want the arithmetic encoder to produce the same number of bytes that the arithmetic decoder will consume,
    - I want to avoid storing explicit markers or stream lengths,

    Currently my method is to:
    - do renormalization before encoding/ decoding of each symbol instead of after,
    - in encoder, after encoding all the symbols, I flush the current buffer as 4 bytes to stream,

    After a few quick tests, it seems that my method works.

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    If you encode using (low, high) or (low, range) and flush 4 bytes of low, then isn't it possible to encode more symbols at the low end of the range without causing a normalization or changing the value of high/range?

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,488
    Thanks
    26
    Thanked 130 Times in 100 Posts
    Probably the third statement ("I want to avoid storing explicit markers") was misleading. I code a special flag to denote EOF.

    Java code for encoding EOF (before each symbol the flag is `true`, on EOF it is `false`):
    Code:
        private void encodeSkewed(final boolean flag) throws IOException {
            normalize();
            if (flag) {
                rcRange--;
            } else {
                rcBuffer += rcRange - 1;
                if (rcBuffer > 0xFFFFFFFFl) {
                    carry = true;
                    rcBuffer = rcBuffer & 0xFFFFFFFFl;
                }
                rcRange = 1;
            }
        }
    Now it should be clear.
    Last edited by Piotr Tarsa; 4th September 2012 at 01:21.

  4. #4
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    It almost sounds like your trying to do a bijective arithmetic compressor except sort of use it for streams where there can be several separated files all one after the other. One easy way to do that is to have the first field contain the length of the compressed date in the stream. You do a normal bijective arithmetic compression you get an output data set that contains exactly what the decompress needs. When you decompress you use the first field the length of the compressed stream as where an EOF would occur if doing bijective file decompression then do the standard bijective decompression. No wasted bits.

  5. #5
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,488
    Thanks
    26
    Thanked 130 Times in 100 Posts
    You are generally right, except that I have explicitly stated that I do not want to store stream lengths. I want to support piping, ie reading streams like stdin and writing to streams like stdout and we cannot compute lengths of such stream in general.

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    The ZPAQ encoder also does not store the uncompressed size. It codes an EOF flag after each byte with an effective probability of 1 - 1/range of false and 2^-32 of true. This requires 9 coding operations per byte, but only 8 of them have any significant modeling so the overhead is not much. The exact algorithm is described in section 4 of http://mattmahoney.net/dc/zpaq200.pdf with PR = 0, where 1 means EOF.

    Also, the encoder throws away a tiny bit of coding space to avoid writing 4 consecutive 0 bytes. It uses this instead to mark the end of the stream so it can be found quickly without decoding.

  7. #7
    Member
    Join Date
    Apr 2010
    Location
    El Salvador
    Posts
    43
    Thanks
    0
    Thanked 1 Time in 1 Post
    I have a few a examples for minimum bits terminations, maybe it helps you, I use it for bit-FIFO (concatenation) of independent arithmetic streams:

    Type 1 (collapsing the range, bytewise, range coder):

     
    forceinline int finishQueue(regval &bitfield) const {
    #if 0
    const int nbits = 18;

    /* output the nbits integer bits (3) */
    bitfield = (ast.low + (ast.range >> 1)) >> (regsze - nbits);
    #else
    int nbits; regval roundup, value;

    /* output the nbits integer bits (1, 2, 3) */
    for (nbits = 1; nbits <= regsze; nbits++) {
    roundup = (1 << (regsze - nbits)) - 1;
    bitfield = (ast.low + roundup) >> (regsze - nbits);
    value = bitfield << (regsze - nbits);

    /* the value-interval (value + whatever bits)
    * [value,value+roundup)
    * falls now always between the interval
    * [low,low+range)
    * ->
    * low <= value <= value+roundup <= low+range
    */
    if (((value + 0) >= (ast.low + 0)) && ((value + roundup) <= (ast.low + (ast.range - 1))))
    break;
    }
    #endif

    assert(nbits <= 18);
    return nbits;
    }

    void injectStop() {
    #define CODING_DISAMBYTS_AR_RANGELOW 3
    /* the formula for calculating the upper bound on the minimum number of
    * bits required is actually something like clz(maxLow - minRange) + 2
    * mostly it means the bytes below botRange can simply be stripped if
    * we use a carry-less coder
    *
    * in the case of delayed underflow-detection this number can be as large
    * as the entire register (because range can approach 1); there is no extra
    * production of of bits though, as normally underflow-bits would build up
    * instead of additional flush-bits
    *
    * in effect it's very important to have a calculatable variable number of
    * termination-bits for the delayed underflow-detection, because otherwise
    * a static known number of bits would require us to always flush all bits
    * in expectance that range could be 1.
    */

    regval bitfield; int nbits = finishQueue(bitfield);
    while (nbits % BITS_OF_UCHAR) bitfield <<= 1, nbits++;
    for (int i = BITS_OF_UCHAR; i <= nbits; i += BITS_OF_UCHAR)
    io->putSample((unsigned char)(bitfield >> (nbits - i)));

    /* Output excess bytes the decoder sucks up. */
    arCoding::injectFlush<regval>(regsze / BITS_OF_UCHAR, (nbits + 7) / BITS_OF_UCHAR, false);
    arCoding::injectStop();
    }

    void dejectStop() {
    regval bitfield; int nbits = finishQueue(bitfield);
    arCoding::dejectCarryOut<regval>(ast.code, regsze / BITS_OF_UCHAR, (nbits + 7) / BITS_OF_UCHAR, false);
    arCoding::dejectStop();
    }



    Type 2 (same as range coder case, just for bits, DC96 coder):


    forceinline int finishQueue(regval &bitfield) const {
    #if 0
    const int nbits = 3;

    /* output the nbits integer bits (3) */
    bitfield = (ast.low + (ast.range >> 1)) >> (regsze - nbits);
    #else
    int nbits; regval roundup, value;

    /* output the nbits integer bits (1, 2, 3) */
    for (nbits = 1; nbits <= regsze; nbits++) {
    roundup = (1 << (regsze - nbits)) - 1;
    bitfield = (ast.low + roundup) >> (regsze - nbits);
    value = bitfield << (regsze - nbits);

    /* the value-interval (value + whatever bits)
    * [value,value+roundup)
    * falls now always between the interval
    * [low,low+range)
    * ->
    * low <= value <= value+roundup <= low+range
    */
    if (((value + 0) >= (ast.low + 0)) && ((value + roundup) <= (ast.low + (ast.range - 1))))
    /* analogue to decodeInterval: */
    // if ((value - ast.low >= 0) && (value - ast.low + roundup <= ast.range - 1))
    break;
    }
    #endif

    assert(nbits <= 3);
    return nbits;
    }

    void injectStop() {
    #define CODING_DISAMBITS_AR_DCC96LOW regsze //3

    /* coders which work with code/value instead of low, manipulate the
    * register while decoding (eg. data-low), this of course destroys
    * trailing data from subsequent decoding; the only way to safeguard
    * the trailing data is to put the entire register out on Stop().!
    *
    * the other possibility is to maintain two registers code and low
    * then the original value can be reconstructed through code+low
    */

    regval bitfield; int nbits = finishQueue(bitfield);
    for (int i = 1; i <= nbits; i++)
    clearUnderflow((bitfield >> (nbits - i)) & 1);

    /* fixed number of bits (with trailing zeros) */
    if (bpolicy && (nbits < CODING_DISAMBITS_AR_DCC96LOW)) {
    bio->BitIOLeft::clearSamples(CODING_DISAMBITS_AR_DCC96LOW - nbits);
    nbits = CODING_DISAMBITS_AR_DCC96LOW;
    }

    /* Output excess bits the decoder sucks up. */
    arCoding::injectFlush<regval>(regsze, nbits);
    arCoding::injectStop();
    }

    void dejectStop() {
    int nbits = CODING_DISAMBITS_AR_DCC96LOW;
    arCoding::dejectCarryOut<regval>(ast.data, regsze, nbits);
    arCoding::dejectStop();
    }



    Type 3 (just add some bits, CACM coder):

      void injectStop() {
    #define CODING_DISAMBITS_AR_CACMLOW 2
    /* Output two bits that select the quarter that
    * the current code range contains.
    */
    ast.underflow += 1;
    clearUnderflow(!(ast.low < firstRange));

    /* Output excess bits the decoder sucks up. */
    arCoding::injectFlush<regval>(regsze, CODING_DISAMBITS_AR_CACMLOW);
    arCoding::injectStop();
    }

    void dejectStop() {
    arCoding::dejectCarryOut<regval>(ast.value, regsze, CODING_DISAMBITS_AR_CACMLOW);
    arCoding::dejectStop();
    }



    Despite not being complete, I hope the functionnames and all are selfexplaining.

    Now the "dejectCarryOver"/"dejectCarryOut"/"injectFlush" are simple function providing the carry abilitity and adding X guard bits etc. pp. Not that important for you I think. If you want to dig into it here's the code.

  8. #8
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    You are generally right, except that I have explicitly stated that I do not want to store stream lengths. I want to support piping, ie reading streams like stdin and writing to streams like stdout and we cannot compute lengths of such stream in general.
    I was actually thinking of streams. The beginning word of the data block is just a length of that block which just happens to be the exact length of the bijectively compressed file. I am not saying to use it as an End Of Stream the stream continues on.
    But yes you have to compute the value after you have compressed the file. Otherwise you have to encode an EOF so you know where in the stream your compressed data ended. If this is not what you want then it seems to be you want something that is really nothing more that the standard ending to compressing to a stream so not exactly a new problem. If you are worried about squeezing a few extra bits on the average. You can encode the number of bits needed as a universal number in front of the bijective compression which has the distance to the last ONE used in the compressed string you don't even need to write this last ONE bit thus saving an extra bit. In which case the stream can be thought of as a bit stream instead of byte stream.

Similar Threads

  1. Compress any file to 4 bytes
    By Matt Mahoney in forum The Off-Topic Lounge
    Replies: 5
    Last Post: 28th June 2011, 07:11
  2. Code generation in LZ decoder / Branchless LZ77 Decoder
    By Shelwien in forum Data Compression
    Replies: 1
    Last Post: 30th September 2010, 20:48
  3. PPMX - a new PPM encoder
    By encode in forum Data Compression
    Replies: 14
    Last Post: 30th November 2008, 16:03
  4. about files to test encoder
    By Krzysiek in forum Data Compression
    Replies: 3
    Last Post: 9th July 2008, 21:22
  5. UNZ - minimal LZW decoder + source
    By encode in forum Forum Archive
    Replies: 7
    Last Post: 29th January 2008, 14:54

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •