Results 1 to 30 of 30

Thread: Optimal ordered bitcodes (or smth like this)

  1. #1
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts

    Optimal ordered bitcodes (or smth like this)

    I have an idea about such codes, ie how to build ordered bitcode.

    1. Build a Huffman tree,
    2. Assign sumFreq = 2 ^ deepestLeafLevel
    3. Assign freq[symbol] = sumFreq / 2 ^ symbolLevel
    4. Encode the stream using range coder with that probabilities.

    Now the problem is: could it be splitted and still decodable (with some small context at most)? I think yes.

    Shelwien tried using normal arithmetic coding in his BWT sorter that uses < 1N of memory, but he failed. Instead he used "optimal ordered bitcodes" of some kind.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,323
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    http://encode.su/threads/378-Ordered...es-experiments
    Especially pay attention to Jan Ondrus' post

    I actually implemented 4 such codes, but Jan's dynamic programming method gives the best results.
    Also better results are not possible, because its provably optimal.

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Probably I've used wrong name for that coder. It's almost usual range coder, except that frequencies are rounded to powers of two. Compression should be identical to Huffman codes (except the last few bytes extra for range coder) and it should be decodable from any place - provided a small context first, of course. I am not sure about that decodability (starting from any place), unfortunately.

    PS:
    Maybe not decodability but comparability.
    Last edited by Piotr Tarsa; 10th June 2011 at 19:33.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,323
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    There's a source for such rangecoder in that thread, also a radix BWT based on it in http://encode.su/threads/379-BWT-wit...sed-input-data
    And no, compression is far from identical, or nobody would have to use huffman's algo to generate optimal prefix codes.

  5. #5
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    I'm not proposing prefix codes, but ordered coding. Usual range coders also don't produce prefix codes.

    If you talked about rcbwt_v0/not working, then:
    - I don't see any Huffman tree building stage here,
    - symbol frequencies aren't powers of two,
    - starting range is also not a power of two,
    Last edited by Piotr Tarsa; 10th June 2011 at 20:08.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,323
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    As to huffman, I just ignored that, because the cumulative sum of huffman freqs (ari's low) in lexicographic order isn't necessarily a power of 2.
    So afaik your huffman freq step is meaningless.
    But we still can build a bitcode by padding arithmetic codes to integer number of bits, although that's less efficient than huffman/optimal ordered.

    Also I meant v0/working which doesn't build anything special, but has an alignment step in encoding functions.

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Well, ari's low isn't a power of two, of course, but it doesn't need to be. In my proposal:
    - ari's range is always power of two,
    - frequency sum is always power of two,
    - symbol frequency is always power of two,
    That means there's no rounding errors, so streams should be decodable/ comparable from any place provided a small context (ari's low and ari's range I think).

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,323
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    range doesn't matter.
    just think a little - the actual bitcode is ari's low, not freq.

    Code:
     unsigned Process( unsigned freq, unsigned bit ) { 
        unsigned rnew = (qword(range)*(freq<<(32-SCALElog)))>>32;
        if( DECODE ) bit = (code>=rnew);
        bit ? range-=rnew, (DECODE ? code-=rnew : lowc+=rnew) : range=rnew;
        while( range<sTOP ) range<<=8, (DECODE ? (code<<=8)+=getc(f) : ShiftLow());
        return bit;
      }
    See that "range-=rnew"? In your case that would be 2^n-2^k (first time), but how does that help?

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    We should encode entire symbol at once, not bit-by-bit.

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,323
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    It doesn't matter, but ok, here's a bytewise rc:
    Code:
      void rc_Process( uint cumFreq, uint freq, uint totFreq ) {
        range /= totFreq;
        uint tmp = cumFreq*range;
        if( DECODE ) code-=tmp; else lowc+=tmp;
        range *= freq;
        Renorm();
      }

  11. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    So? My statements are true. Range is always power of two and there are no rounding errors. We could remember the 'low' value for each position in input buffer (plus the current range) so we'll be able to decode/ compare from any position. Am I right?

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,323
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    Code:
    // stats: a.freq=1<<0, b.freq=1<<1, c.freq=1<<0 -- powers of 2
    range = 16; t = 4;
    rc_Process( a.freq+b.freq, c.freq, t );
    // range /= 4; -- 4
    // range *= 3; -- 12, not 2^n

  13. #13
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Range * cumFreq is only added/ subtracted to lowc/ code, actual range is always power of two.

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,323
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    Actual rangecoder's range is 12 in the example above, which is not a power of two.
    Which means that rounding _would_ happen at some point, and its result would depend on initial range value
    at the first symbol of some substring.
    So while symbol intervals would be the same, actual arithmetic code can be subtly different in different
    occurrences of the same string.

    Huffman can avoid that by reordering the symbols, but that's what we want to avoid.

  15. #15
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Code:
    void rc_Process( uint cumFreq, uint freq, uint totFreq ) {
        range /= totFreq;
        uint tmp = cumFreq*range;
        if( DECODE ) code-=tmp; else lowc+=tmp;
        range *= freq;
        Renorm();
      }
    Let's say:
    cumFreq = 3,
    freq = 1
    totFreq = 4
    range = 16
    lowc = 10
    mode = encode

    Let's see what happens:
    Code:
    void rc_Process( uint cumFreq = 3, uint freq = 1, uint totFreq = 4) {
        range /= totFreq;
        // now range == 4
        uint tmp = cumFreq*range;
        // now tmp == 12
        if( DECODE ) code-=tmp; else lowc+=tmp;
        // now lowc = 22
        range *= freq;
        // now range = 4
        Renorm();
        // now situation depends on the contents of renorm() function
      }
    So after encoding that symbol we have:
    range = 4
    lowc = 22

    Range is still a power of two.

    Do you see any error here?

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,323
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    Ok, so it was my mistake in the end, then your idea should work

    Although it won't be a bitcode anymore.

    Also I guess we can extend this further... we only need total%freq=0 , so
    there're more options than just 2^n for both.
    Btw range value can be anything if its not rounded anywhere.

  17. #17
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Glad to know

    Also I guess we can extend this further... we only need total%freq=0 , so
    there're more options than just 2^n for both.
    Btw range value can be anything if its not rounded anywhere.
    Range must be divisible by totFreq, so it can't be anything.

    2^n is convenient because Huffman algo uses binary trees. Maybe with ternary trees we could use 3^n frequencies but then: how to renormalize? And I doubt that there will be improvements in general case.
    Last edited by Piotr Tarsa; 10th June 2011 at 21:27.

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,323
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    > Range must be divisible by totFreq, so it can't be anything.

    "anything if its not rounded anywhere", but it doesn't have to be 2^n, only n*totFreq.
    And renorm can do if( range<totFreq ) range*=totFreq;

    As to binary trees and huffman algo, it seems that we can go further than that.
    We only need range=LCM(freqs[])*totFreq, so with large enough precision for range and low enough precision for freqs
    it should be possible to use any freqs.

    ...Ok, not so simple, but that's the idea

  19. #19
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Are you sure?

    Let's assume:
    freq[a] = 2
    freq[b] = 3
    totFreq = 5

    range = totFreq * LCM = 5 * 6 = 30
    low = 10

    Encode 'a':
    Code:
    void rc_Process( uint cumFreq = 0, uint freq = 2, uint totFreq = 5 ) {
        range /= totFreq;
        // now range == 6
        uint tmp = cumFreq*range;
        // now tmp = 0
        if( DECODE ) code-=tmp; else lowc+=tmp;
        // now lowc == 10
        range *= freq;
        // now range = 12
        Renorm();
      }
    Now range > totFreq, but range isn't divisible by totFreq and, more importantly, we can't round it to some multiply of totFreq (as we precluded rounding).

  20. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,323
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    Sure, I understand that, that's why there's that remark at the end.
    LCM just seems related, but its certainly not how it should be used.

    The idea is to do renorm not when range<limit, but when range%totFreq!=0.
    But I still don't see how to setup that to avoid overflows. Sorry, I'm not very bright today

    Real ordered bitcode still might make more sense though, because it only requires
    storing context and bit pointer per position, while methods based on rc require
    context, low,range and byte pointer.

  21. #21
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Sorry, I'm not very bright today
    It's OK.

    Real ordered bitcode still might make more sense though, because it only requires
    storing context and bit pointer per position, while methods based on rc require
    context, low,range and byte pointer.
    If we take some two occurences of a n-gram from compressed output, then after subtracting 'low's from their beginning, the compressed bitstreams should be identical. So we can firstly sort & split by that prefixes that are affected by 'low' and then forget about contexts (or whatever) and use only bit pointers. Am I right?

  22. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,323
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    1. There's a question whether we use bitwise output (range can stay always equal to 2^31) or bytewise.
    In the first case we'd only have to store bit pointer + low, but encoding would be much slower
    (renorm loops and bitwise output) and we'd have to subtract low at bit offsets.

    And for bytewise coder we'd have to store log2(range),low, and byte pointer.

    2. As to "contexts", they're only necessary if we'd use order1+ statistics there.
    In which case symbol intervals would be different in different contexts, so we'd
    be only able to compare compressed strings with same contexts.
    And after sorting we won't need pointers instead.

    Btw, with order0 code we'd have to actually decode data bytes from the compressed
    buffer, which is not very fast.

  23. #23
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    1. We should use whatever that is faster. I think that even if we use bitwise renormalization then it shouldn't be much slower, as it outputs the same number of bytes and we can operate on logarithms (of base 2, of course) of ranges and frequencies (except cumFreq and lowc, but converting from log2 to full representation is just one lookup in a tiny LUT) so then finding how many bits to shift out is a constant time operation.

    Update:
    If we operate on logarithms of ranges then we can represent a 2^32 range!
    Update end.

    But should the type of renormalization affect the final output in this case?

    2. By contexts I meant low and range.

    Btw, with order0 code we'd have to actually decode data bytes from the compressed
    buffer, which is not very fast.
    Why? With prefix codes there's no decoding needed, for sure. With my proposal, when we subtract the initial (ie. occuring at current position) ari's low, then we should get an ordered string (ie. identical for any occurence).

    PS:
    If the above is true (ie. we don't need to decode anything before comparing) then *maybe* it could be suitable for my QRotSort.

    http://encode.su/threads/379-BWT-wit...ull=1#post7540
    I suppose here are results for o1 prefix coding. I wonder what would be the compression ratio of o2 huff-range (ie. variant described in this topic) coding.
    Last edited by Piotr Tarsa; 11th June 2011 at 01:28.

  24. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,323
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    > 1. We should use whatever that is faster. I think that even if we
    > use bitwise renormalization then it shouldn't be much slower, as it
    > outputs the same number of bytes

    Ok, I actually have some stats (enwik:

    http://nishi.dreamhosters.com/u/fpaq0pv4b4.rar

    Code:
         c.size   c.time  d.time
    rc2d 61287234  2.969s 3.203s // rc with 16-bit output 
    rc2e 61256047  3.547s 3.641s // rc with bytewise output
    rc2f 61256047 11.640s 7.438s // bitwise
    I guess bitwise encoding is especially slow because of these
    cache bit output loops in ShiftLow().

    Anyway, bitwise ari needs several extra branches and loops,
    so it doesn't make any sense.

    > converting from log2 to full representation is
    > just one lookup in a tiny LUT)

    We don't need a LUT for a shift operation

    > so then finding how many bits to shift out is a constant time
    > operation.

    Its not, because number of bits to shift depends on low,
    and we already agreed that its not 2^n.
    Also there're some speed optimizations which could be normally applied,
    like carryless coding, but here we can't use an optimization which affects
    string comparisons.

    > If we operate on logarithms of ranges then we can represent a 2^32 range!

    We can represent it even without that, by implying +1.

    > But should the type of renormalization affect the final output in this case?

    Seemingly not, but byte-aligned low subtraction and so on would be much faster.

    >> Btw, with order0 code we'd have to actually decode data bytes from the compressed
    >> buffer, which is not very fast.

    > Why? With prefix codes there's no decoding needed, for sure. With my
    > proposal, when we subtract the initial (ie. occuring at current
    > position) ari's low, then we should get an ordered string (ie.
    > identical for any occurence).

    Don't forget about BWT output.
    I meant the fact that after sorting the pointers, we'd have to produce
    the BWT... and if you suggest to keep and sort uncompressed bytes
    along with the pointers, then there's no sense not to use order1 compression.

  25. #25
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Don't forget about BWT output.
    I meant the fact that after sorting the pointers, we'd have to produce
    the BWT... and if you suggest to keep and sort uncompressed bytes
    along with the pointers, then there's no sense not to use order1 compression.
    OK. I thought you said I have to decode while sorting. Actually, I think that in case of my QRotSort it would be better to store pointer to uncompressed data together with pointer to compressed data, so that I can use radix doubling technique as finishing step. Naive sorting is too time consuming when average LCP is high.

    We don't need a LUT for a shift operation
    Duh, nevermind :P

    Its not, because number of bits to shift depends on low,
    and we already agreed that its not 2^n.
    It depends on range, not low. In every coder I saw, renormalization loops looks like that:
    Code:
    while (range < threshold) {
      renorm();
    }
    The only problem is managing carries, but it only adds some little complexity, and it doesn't require loops.

    You're also showing bitwise coders, while I'm talking about bytewise coder.
    Last edited by Piotr Tarsa; 11th June 2011 at 02:46.

  26. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,323
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    > Actually, I think that in case of my QRotSort it would be better to
    > store pointer to uncompressed data together with pointer to
    > compressed data, so that I can use radix doubling technique as
    > finishing step.
    > Naive sorting is too time consuming when average LCP is high.

    For redundant data, compression would be also good.

    And don't forget the rcbwt3 idea about rangecoding strings from
    each position into fixed-size (32-bit in my case) chunks.
    Like that its possible to use normal arithmetic coding, without
    any renorms even, and with plain byte pointers and nothing else...
    but it would be still 8N... and while there's a linear way to
    produce the codes (by adding a prefix interval), I couldn't find
    a linear way to count how many symbols are stored in a value.
    (To skip them when I see that two 32-bit values match).

    > You're also showing bitwise coders, while I'm talking about bytewise coder.

    I showed only one bitwise coder (rc2f), and you started it by saying
    "I think that even if we use bitwise renormalization then it
    shouldn't be much slower"

    > I suppose here are results for o1 prefix coding. I wonder what would
    > be the compression ratio of o2 huff-range (ie. variant described in
    > this topic) coding.

    book1 result with o2 ari is about 282-283k, huffman should be about 300k.
    But o2 is problematic. You'd have to do 256^3 operations to build the code
    (worst case, but still), and the encoding won't fit in L1 (code table
    would take at least 128k), which would make it 3x slower than o1.

    However, we don't have to use an integer number of bytes as context.
    Nor we have to use sequential bits.
    For example, a context mask 0x5F5F could provide better compression
    than 0xFFFF on texts (because it does upcase), while only having 12 bits.

    Btw, I'm still thinking about full ari without rounding.
    Let's make it binary for simplicity (only 2 freqs per bit).
    So then we only need f0+f1<=total (not "=") and totals can
    be different for different bits.
    Can't we somehow setup it so that it would work out
    (/total *freq and renorms) without having to use 2^n freqs?

  27. #27
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    > You're also showing bitwise coders, while I'm talking about bytewise coder.

    I showed only one bitwise coder (rc2f), and you started it by saying
    "I think that even if we use bitwise renormalization then it
    shouldn't be much slower"
    I had considered (in my mind) a coder that takes a symbol as an input, codes it and then does "bitwise" renormalization, ie outputs 32 - logRange bits at once (so it's not necessary an integral number of bytes) to a temporary buffer and then outputs bytes from that temporary buffer. After encoding logRange will still be 32.

    But o2 is problematic. You'd have to do 256^3 operations to build the code
    (worst case, but still), and the encoding won't fit in L1 (code table
    would take at least 128k), which would make it 3x slower than o1.
    I'll have to weigh up the pros and cons of o1 and o2.
    Last edited by Piotr Tarsa; 11th June 2011 at 04:26.

  28. #28
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    I have completed my ordered order-1 coder. It has compression equal to Huffman codes, but nevertheless my coded is much slower than Cyan's Huff0. Maybe Cyan encode/ decode more than one symbol at a time? I don't know.

    On my computer (which is equal to Cyan's, ie I have Core 2 Duo E8400) my coder in order-0 mode has speed of about 85 MBps/ 110 MBps (encoding/ decoding) which is low compared to Cyan's about 200 MBps/ 200 MBps. In order-1 mode my coder has speed of about 68 MBps/ 80 MBps.

    Shelwien's (or maybe Ondrus') optimal bitcodes of order-1 compress enwik8 to 53711205, while my coder compressed enwik8 to 49078804, ie it's about 9% smaller.

    Here's the code - note that I've borrowed a skeleton from my rotation sorting programs and I've commented/ ommited/ not yet implemented some parts, so this serves only for testing purposes now:
    Code:
    /* 
     * File:   main.cpp
     * Author: Piotr Tarsa
     *
     */
    
    #include <cassert>
    #include <cstdio>
    #include <cstdlib>
    #include <ctime>
    
    #include <algorithm>
    #include <queue>
    #include <stack>
    #include <utility>
    #include <vector>
    
    //#define USE_CONTEXT_0_NOT_1
    
    struct _index_t {
        int32_t index;
        int32_t info;
    };
    
    typedef struct _index_t index_t;
    
    const int32_t decodingLookupBits = 10;
    
    struct _context_t {
        u_int8_t index[256];
        u_int8_t value[256];
        u_int8_t logRange[256];
        u_int32_t cumulativeRange[256];
        u_int8_t decodedIndex[1 << decodingLookupBits];
        int32_t numSymbols;
    };
    
    typedef struct _context_t context_t;
    
    struct _huffman_node_t {
        int32_t weight;
        int32_t index;
        int32_t leftChild;
        int32_t rightChild;
    
        int operator()(const _huffman_node_t &_this, const _huffman_node_t & _that) const {
            return _this.weight > _that.weight;
        }
    };
    
    typedef struct _huffman_node_t huffman_node_t;
    
    size_t length;
    int32_t *SA;
    u_int8_t *sourceBuffer;
    u_int8_t *outputBuffer;
    
    template<typename T>
    T min(T former, T latter) {
        return former < latter ? former : latter;
    }
    
    template<typename T>
    T max(T former, T latter) {
        return former > latter ? former : latter;
    }
    
    int32_t clampOnce(int32_t value, int32_t limit) {
        return (value >= limit) ? value - limit : value;
    }
    
    /*
     * 
     */
    int main(int argc, char** argv) {
    
        freopen("report.txt", "w", stderr);
    
        //FILE *sourceFile = fopen("/home/piotrek/Dokumenty/BWT/qsufsort.c", "rb");
        FILE *sourceFile = fopen("/home/piotrek/Pobrane/corpora/enwik/enwik8", "rb");
    
        fseek(sourceFile, 0, SEEK_END);
        length = ftell(sourceFile);
        const int32_t DepthLimit = length;
        printf("File length: %ld\n", length);
        fseek(sourceFile, 0, SEEK_SET);
    
        SA = new int32_t[length];
    
        sourceBuffer = new u_int8_t[length];
        outputBuffer = new u_int8_t[length];
    
        size_t readBytes = fread(sourceBuffer, sizeof (u_int8_t), length, sourceFile);
        assert(readBytes == length);
    
        fclose(sourceFile);
    
        clock_t time = clock();
    
        int32_t counts[256][256];
        for (int32_t i = 0; i < 256; i++) {
            for (int32_t j = 0; j < 256; j++) {
                counts[i][j] = 0;
            }
        }
    
        {
            u_int8_t lastChar = sourceBuffer[length - 1];
            for (u_int32_t i = 0; i < length; i++) {
    #ifdef USE_CONTEXT_0_NOT_1
                lastChar = 0;
    #endif
                const u_int8_t currentChar = sourceBuffer[i];
                counts[lastChar][currentChar]++;
                lastChar = currentChar;
            }
        }
    
        context_t contexts[256];
    
        {
            huffman_node_t nodes[512];
            std::priority_queue<huffman_node_t, std::vector<huffman_node_t>, huffman_node_t> heap;
            for (int32_t context = 0; context < 256; context++) {
                bool needRescaling = false;
                do {
                    int32_t numSymbols = 0;
                    for (int32_t i = 0; i < 256; i++) {
                        if (counts[context][i] > 0) {
                            huffman_node_t huffmanNode;
                            huffmanNode.weight = counts[context][i];
                            contexts[context].value[numSymbols] = i;
                            contexts[context].index[i] = numSymbols;
                            huffmanNode.index = numSymbols++;
                            heap.push(huffmanNode);
                        }
                    }
                    contexts[context].numSymbols = numSymbols;
                    if (numSymbols == 0) {
                        continue;
                    }
                    int32_t firstFree = 0;
                    int32_t rootNode = 0;
                    while (true) {
                        huffman_node_t firstNode = heap.top();
                        heap.pop();
                        if (heap.empty()) {
                            rootNode = firstFree;
                            nodes[firstFree++] = firstNode;
                            break;
                        }
                        huffman_node_t secondNode = heap.top();
                        heap.pop();
                        huffman_node_t newNode;
                        newNode.weight = firstNode.weight + secondNode.weight;
                        newNode.index = -1;
                        newNode.leftChild = firstFree;
                        nodes[firstFree++] = firstNode;
                        newNode.rightChild = firstFree;
                        nodes[firstFree++] = secondNode;
                        heap.push(newNode);
                    }
                    int32_t currentLogRange = 32;
                    std::stack<std::pair<int32_t, int32_t> > stack;
                    int32_t currentNode = rootNode;
                    while (true) {
                        if (nodes[currentNode].index == -1) {
                            currentLogRange--;
                            needRescaling |= (currentLogRange < 0);
                            stack.push(std::make_pair(nodes[currentNode].rightChild, currentLogRange));
                            currentNode = nodes[currentNode].leftChild;
                        } else {
                            contexts[context].logRange[nodes[currentNode].index] = currentLogRange;
                            if (stack.empty()) {
                                break;
                            } else {
                                currentNode = stack.top().first;
                                currentLogRange = stack.top().second;
                                stack.pop();
                            }
                        }
                    }
                    u_int32_t currentLow = 0;
                    for (int32_t i = 0; i < numSymbols; i++) {
                        contexts[context].cumulativeRange[i] = currentLow;
                        currentLow += 1 << contexts[context].logRange[i];
                    }
                    u_int8_t decodedIndex = numSymbols - 1;
                    for (int32_t i = (1 << decodingLookupBits) - 1; i >= 0; i--) {
                        u_int32_t currentLow = i << (32 - decodingLookupBits);
                        while (contexts[context].cumulativeRange[decodedIndex] > currentLow) {
                            decodedIndex--;
                        }
                        contexts[context].decodedIndex[i] = decodedIndex;
                    }
                    if (needRescaling) {
                        for (u_int32_t symbol = 0; symbol < 256; symbol++) {
                            counts[context][symbol] += 1;
                            counts[context][symbol] /= 2;
                        }
                        puts("rescaling");
                    }
                } while (needRescaling);
            }
        }
    
        bool useCompressedInput = true;
    
        {
            int64_t totalCost = 0;
            for (int32_t context = 0; context < 256; context++) {
                for (int32_t symbol = 0; symbol < 256; symbol++) {
                    if (counts[context][symbol] > 0) {
                        totalCost += counts[context][symbol] *
                                (32 - contexts[context].logRange[contexts[context].index[symbol]]);
                    }
                }
            }
            double ratio = (double) totalCost / 8. / length;
            printf("Compression ratio: %lf\n", ratio);
            if ((length < 1000) || (ratio > 0.9)) {
                useCompressedInput = false;
            }
        }
    
        printf("Process time: %ld\n", clock() - time);
    
        // compute compressed input
        if (useCompressedInput) {
            time = clock();
            for (u_int32_t i = 0; i < length; i++) {
                outputBuffer[i] = 0;
            }
    
            int32_t outputIndex = 0;
            int32_t numFFdwords = 0;
            int32_t numHeadBits = 0;
            int32_t numTailBits = 0;
            u_int32_t headDword = 0;
            u_int32_t tailDword = 0;
            u_int32_t currentLow = 0;
    
            int32_t lastByte = sourceBuffer[length - 1];
    
            for (u_int32_t i = 0; i <= length; i++) {
    #ifdef USE_CONTEXT_0_NOT_1
                lastByte = 0;
    #endif
                int32_t currentByte = (i == length) ? 0 : sourceBuffer[i];
                int32_t currentIndex = (i == length) ? 0 : contexts[lastByte].index[currentByte];
    
                int32_t currentLogRange = (i == length) ? 0 : contexts[lastByte].logRange[currentIndex];
                int32_t bitsToShift = (i == length) ? 32 : (32 - currentLogRange);
    
                u_int32_t newLow = currentLow + ((i == length) ? *(u_int32_t*) (sourceBuffer) :
                        contexts[lastByte].cumulativeRange[currentIndex]);
    
                if (newLow < currentLow) { // overflow
                    if (numHeadBits == 32) {
                        if (numTailBits > 0) {
                            tailDword += 1 << (32 - numTailBits);
                        }
                        if (tailDword == 0) {
                            *(u_int32_t*) (outputBuffer + outputIndex) = headDword + 1;
                            outputIndex += 4;
                            while (numFFdwords > 0) {
                                numFFdwords--;
                                *(u_int32_t*) (outputBuffer + outputIndex) = 0;
                                outputIndex += 4;
                            }
                            headDword = 0;
                            numHeadBits = numTailBits;
                            numTailBits = 0;
                        } else if (tailDword == 0xffffffff) {
                            numFFdwords++;
                            tailDword = 0;
                            numTailBits = 0;
                        }
                    } else {
                        if (numHeadBits > 0) {
                            headDword += 1 << (32 - numHeadBits);
                        }
                    }
                }
    
                if (numHeadBits == 32) {
                    if (numTailBits + bitsToShift < 32) {
                        if (bitsToShift > 0) {
                            tailDword += (newLow >> currentLogRange) << (currentLogRange - numTailBits);
                            numTailBits += bitsToShift;
                        }
                    } else {
                        int32_t tailBitsToShift = 32 - numTailBits;
    
                        tailDword += newLow >> numTailBits;
                        if (tailDword == 0xffffffff) {
                            numFFdwords++;
                        } else {
                            *(u_int32_t*) (outputBuffer + outputIndex) = headDword;
                            outputIndex += 4;
                            while (numFFdwords > 0) {
                                numFFdwords--;
                                *(u_int32_t*) (outputBuffer + outputIndex) = 0xffffffff;
                                outputIndex += 4;
                            }
                            headDword = tailDword;
                            numHeadBits = 32;
                        }
                        tailDword = (currentLogRange + tailBitsToShift < 32) ?
                                (newLow >> currentLogRange) << (currentLogRange + tailBitsToShift) : 0;
                        numTailBits = bitsToShift - tailBitsToShift;
                    }
                } else {
                    if (numHeadBits + bitsToShift <= 32) {
                        headDword += (newLow >> currentLogRange) << (currentLogRange - numHeadBits);
                        numHeadBits += bitsToShift;
                    } else {
                        int32_t headBitsToShift = 32 - numHeadBits;
    
                        headDword += newLow >> numHeadBits;
                        tailDword = (newLow >> currentLogRange) << (currentLogRange + headBitsToShift);
    
                        numHeadBits = 32;
                        numTailBits = bitsToShift - headBitsToShift;
                    }
                }
    
                currentLow = newLow << bitsToShift;
                lastByte = currentByte;
            }
    
            if (numHeadBits > 0) {
                *(u_int32_t*) (outputBuffer + outputIndex) = headDword;
                outputIndex += 4;
                while (numFFdwords > 0) {
                    numFFdwords--;
                    *(u_int32_t*) (outputBuffer + outputIndex) = 0xffffffff;
                    outputIndex += 4;
                }
                if (numTailBits > 0) {
                    *(u_int32_t*) (outputBuffer + outputIndex) = tailDword;
                    outputIndex += 4;
                }
            }
    
            printf("Output index: %d\n", outputIndex);
            printf("Process time: %ld\n", clock() - time);
        }
    
        if (true) {
            // test print context
            int32_t testedContext = sourceBuffer[length - 1];
            context_t c = contexts[testedContext];
            for (int32_t index = 0; index < c.numSymbols; index++) {
                printf("%3d %3d %2d %8x\n", index, c.value[index], c.logRange[index],
                        c.cumulativeRange[index]);
            }
            puts("");
            for (int32_t symbol = 0; symbol < 256; symbol++) {
                if (counts[testedContext][symbol] > 0) {
                    printf("%3d %3d\n", symbol, c.index[symbol]);
                }
            }
            for (int32_t low = 0; low < 1024; low++) {
                printf("%3x %3d\n", low << 2, c.decodedIndex[low]);
            }
        }
    
        // verify compressed input
        if (useCompressedInput) {
            bool foundBrokenIndex = false;
            time = clock();
            u_int32_t currentLow = *(u_int32_t*) (outputBuffer + 0);
            u_int32_t bufferDword = *(u_int32_t*) (outputBuffer + 4);
            int32_t bufferBits = 32;
            int32_t outputIndex = 8;
    
            int32_t lastByte = sourceBuffer[length - 1];
            for (u_int32_t i = 0; i < length; i++) {
    #ifdef USE_CONTEXT_0_NOT_1
                lastByte = 0;
    #endif
                context_t &currentContext = contexts[lastByte];
                int32_t currentIndex =
                        currentContext.decodedIndex[currentLow >> (32 - decodingLookupBits)];
                while (((currentIndex + 1) < currentContext.numSymbols) &&
                        (currentLow >= currentContext.cumulativeRange[currentIndex + 1])) {
                    currentIndex++;
                }
                lastByte = currentContext.value[currentIndex];
                if ((lastByte != sourceBuffer[i]) && (!foundBrokenIndex)) {
                    foundBrokenIndex = true;
                    printf("First broken index: %d\n", i);
                }
                int32_t bitsToShift = 32 - currentContext.logRange[currentIndex];
                if (bitsToShift != 0) {
                    currentLow -= currentContext.cumulativeRange[currentIndex];
                    currentLow <<= bitsToShift;
                    currentLow += bufferDword >> (32 - bitsToShift);
                    if (bitsToShift < bufferBits) {
                        bufferDword <<= bitsToShift;
                        bufferBits -= bitsToShift;
                    } else if (bitsToShift > bufferBits) {
                        bufferDword = *(u_int32_t*) (outputBuffer + outputIndex);
                        outputIndex += 4;
                        bitsToShift -= bufferBits;
                        currentLow += bufferDword >> (32 - bitsToShift);
                        bufferDword <<= bitsToShift;
                        bufferBits = 32 - bitsToShift;
                    } else {
                        bufferDword = *(u_int32_t*) (outputBuffer + outputIndex);
                        outputIndex += 4;
                        bitsToShift -= bufferBits;
                        bufferBits = 32;
                    }
                }
            }
            printf("Process time: %ld\n", clock() - time);
            length = outputIndex;
        }
    
        fflush(stdout);
    
        return 0;
    
        for (u_int32_t i = 0; i < length; i++) {
            SA[i] = i;
        }
    
    
        for (u_int32_t i = 0; i < length; i++) {
            outputBuffer[i] = sourceBuffer[clampOnce(SA[i] + length - 1, length)];
        }
    
        {
            clock_t time = clock();
            for (u_int32_t i = 1; i < length; i++) {
                int32_t leftPtr = SA[i - 1];
                int32_t middlePtr = SA[i];
                int32_t equal = 0;
                for (; equal < DepthLimit; equal++) {
                    if (sourceBuffer[leftPtr] > sourceBuffer[middlePtr]) {
                        printf("Wrong order on position: %d, equal prefix length: %d\n", i - 1, equal);
                        break;
                    } else if (sourceBuffer[leftPtr] < sourceBuffer[middlePtr]) {
                        break;
                    }
                    leftPtr = ((leftPtr + 1) == length) ? 0 : leftPtr + 1;
                    middlePtr = ((middlePtr + 1) == length) ? 0 : middlePtr + 1;
                }
            }
            printf("Rotation order checking time: %ld\n", clock() - time);
        }
    
        FILE *outputFile = fopen("/home/piotrek/Pulpit/bwtoutput.mergerotsort", "wb");
    
        size_t writtenBytes = fwrite(outputBuffer, sizeof (u_int8_t), length, outputFile);
        assert(writtenBytes == length);
    
        fclose(outputFile);
    
        return 0;
    }
    Last edited by Piotr Tarsa; 26th July 2011 at 00:08.

  29. #29
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    459
    Thanked 257 Times in 105 Posts

    Thumbs up

    Huff0 only outputs one byte at a time (except when some segments are declared uncompressible, or one-byte repetitive, but this is not common situation).

    The speed comes from the trivial main loop.
    And the main loop is trivial because Huff0 is block based, which means statistics are only updated at block boundaries.

    I have not read your code in detail, but it seems to be much more dynamic, hence its superior compression capability.
    Btw, 68 MBps/ 80 MBps in order-1 mode, i think it is great speed !

  30. #30
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    I have not read your code in detail, but it seems to be much more dynamic, hence its superior compression capability.
    It is not dynamic. In fact it is completely static, ie. codes are computed only once per entire file.
    I have optimized my codec for order-1 coding, maybe with optimization for order-0 I could get significantly closer to yours coder performance. But order-0 is not of a particular interest for me as it offers noticeably lower compression and maybe that loss of efficiency will make it infeasible for block-sorting.
    My coder has equal coding efficiency to Huffman codes (maybe except the last renormalizations and flushes so that the output is a few, I think < 8, bytes longer) and the superiority to Shelwien's approach comes from the fact that ordered bitcodes are inefficient compared to Huffman ones.
    Btw, 68 MBps/ 80 MBps in order-1 mode, i think it is great speed !
    Glad to see that I also think it is good.

Similar Threads

  1. Ordered bitcodes experiments
    By Shelwien in forum Data Compression
    Replies: 19
    Last Post: 30th May 2009, 04:45
  2. optimal parsing
    By flounder in forum Data Compression
    Replies: 10
    Last Post: 3rd August 2008, 14:07

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •