# Thread: Optimal ordered bitcodes (or smth like this)

1. ## Optimal ordered bitcodes (or smth like this)

I have an idea about such codes, ie how to build ordered bitcode.

1. Build a Huffman tree,
2. Assign sumFreq = 2 ^ deepestLeafLevel
3. Assign freq[symbol] = sumFreq / 2 ^ symbolLevel
4. Encode the stream using range coder with that probabilities.

Now the problem is: could it be splitted and still decodable (with some small context at most)? I think yes.

Shelwien tried using normal arithmetic coding in his BWT sorter that uses < 1N of memory, but he failed. Instead he used "optimal ordered bitcodes" of some kind.

Especially pay attention to Jan Ondrus' post

I actually implemented 4 such codes, but Jan's dynamic programming method gives the best results.
Also better results are not possible, because its provably optimal.

3. Probably I've used wrong name for that coder. It's almost usual range coder, except that frequencies are rounded to powers of two. Compression should be identical to Huffman codes (except the last few bytes extra for range coder) and it should be decodable from any place - provided a small context first, of course. I am not sure about that decodability (starting from any place), unfortunately.

PS:
Maybe not decodability but comparability.

4. There's a source for such rangecoder in that thread, also a radix BWT based on it in http://encode.su/threads/379-BWT-wit...sed-input-data
And no, compression is far from identical, or nobody would have to use huffman's algo to generate optimal prefix codes.

5. I'm not proposing prefix codes, but ordered coding. Usual range coders also don't produce prefix codes.

If you talked about rcbwt_v0/not working, then:
- I don't see any Huffman tree building stage here,
- symbol frequencies aren't powers of two,
- starting range is also not a power of two,

6. As to huffman, I just ignored that, because the cumulative sum of huffman freqs (ari's low) in lexicographic order isn't necessarily a power of 2.
So afaik your huffman freq step is meaningless.
But we still can build a bitcode by padding arithmetic codes to integer number of bits, although that's less efficient than huffman/optimal ordered.

Also I meant v0/working which doesn't build anything special, but has an alignment step in encoding functions.

7. Well, ari's low isn't a power of two, of course, but it doesn't need to be. In my proposal:
- ari's range is always power of two,
- frequency sum is always power of two,
- symbol frequency is always power of two,
That means there's no rounding errors, so streams should be decodable/ comparable from any place provided a small context (ari's low and ari's range I think).

8. range doesn't matter.
just think a little - the actual bitcode is ari's low, not freq.

Code:
``` unsigned Process( unsigned freq, unsigned bit ) {
unsigned rnew = (qword(range)*(freq<<(32-SCALElog)))>>32;
if( DECODE ) bit = (code>=rnew);
bit ? range-=rnew, (DECODE ? code-=rnew : lowc+=rnew) : range=rnew;
while( range<sTOP ) range<<=8, (DECODE ? (code<<=8)+=getc(f) : ShiftLow());
return bit;
}```
See that "range-=rnew"? In your case that would be 2^n-2^k (first time), but how does that help?

9. We should encode entire symbol at once, not bit-by-bit.

10. It doesn't matter, but ok, here's a bytewise rc:
Code:
```  void rc_Process( uint cumFreq, uint freq, uint totFreq ) {
range /= totFreq;
uint tmp = cumFreq*range;
if( DECODE ) code-=tmp; else lowc+=tmp;
range *= freq;
Renorm();
}```

11. So? My statements are true. Range is always power of two and there are no rounding errors. We could remember the 'low' value for each position in input buffer (plus the current range) so we'll be able to decode/ compare from any position. Am I right?

12. Code:
```// stats: a.freq=1<<0, b.freq=1<<1, c.freq=1<<0 -- powers of 2
range = 16; t = 4;
rc_Process( a.freq+b.freq, c.freq, t );
// range /= 4; -- 4
// range *= 3; -- 12, not 2^n```

13. Range * cumFreq is only added/ subtracted to lowc/ code, actual range is always power of two.

14. Actual rangecoder's range is 12 in the example above, which is not a power of two.
Which means that rounding _would_ happen at some point, and its result would depend on initial range value
at the first symbol of some substring.
So while symbol intervals would be the same, actual arithmetic code can be subtly different in different
occurrences of the same string.

Huffman can avoid that by reordering the symbols, but that's what we want to avoid.

15. Code:
```void rc_Process( uint cumFreq, uint freq, uint totFreq ) {
range /= totFreq;
uint tmp = cumFreq*range;
if( DECODE ) code-=tmp; else lowc+=tmp;
range *= freq;
Renorm();
}```
Let's say:
cumFreq = 3,
freq = 1
totFreq = 4
range = 16
lowc = 10
mode = encode

Let's see what happens:
Code:
```void rc_Process( uint cumFreq = 3, uint freq = 1, uint totFreq = 4) {
range /= totFreq;
// now range == 4
uint tmp = cumFreq*range;
// now tmp == 12
if( DECODE ) code-=tmp; else lowc+=tmp;
// now lowc = 22
range *= freq;
// now range = 4
Renorm();
// now situation depends on the contents of renorm() function
}```
So after encoding that symbol we have:
range = 4
lowc = 22

Range is still a power of two.

Do you see any error here?

16. Ok, so it was my mistake in the end, then your idea should work

Although it won't be a bitcode anymore.

Also I guess we can extend this further... we only need total%freq=0 , so
there're more options than just 2^n for both.
Btw range value can be anything if its not rounded anywhere.

Also I guess we can extend this further... we only need total%freq=0 , so
there're more options than just 2^n for both.
Btw range value can be anything if its not rounded anywhere.
Range must be divisible by totFreq, so it can't be anything.

2^n is convenient because Huffman algo uses binary trees. Maybe with ternary trees we could use 3^n frequencies but then: how to renormalize? And I doubt that there will be improvements in general case.

18. > Range must be divisible by totFreq, so it can't be anything.

"anything if its not rounded anywhere", but it doesn't have to be 2^n, only n*totFreq.
And renorm can do if( range<totFreq ) range*=totFreq;

As to binary trees and huffman algo, it seems that we can go further than that.
We only need range=LCM(freqs[])*totFreq, so with large enough precision for range and low enough precision for freqs
it should be possible to use any freqs.

...Ok, not so simple, but that's the idea

19. Are you sure?

Let's assume:
freq[a] = 2
freq[b] = 3
totFreq = 5

range = totFreq * LCM = 5 * 6 = 30
low = 10

Encode 'a':
Code:
```void rc_Process( uint cumFreq = 0, uint freq = 2, uint totFreq = 5 ) {
range /= totFreq;
// now range == 6
uint tmp = cumFreq*range;
// now tmp = 0
if( DECODE ) code-=tmp; else lowc+=tmp;
// now lowc == 10
range *= freq;
// now range = 12
Renorm();
}```
Now range > totFreq, but range isn't divisible by totFreq and, more importantly, we can't round it to some multiply of totFreq (as we precluded rounding).

20. Sure, I understand that, that's why there's that remark at the end.
LCM just seems related, but its certainly not how it should be used.

The idea is to do renorm not when range<limit, but when range%totFreq!=0.
But I still don't see how to setup that to avoid overflows. Sorry, I'm not very bright today

Real ordered bitcode still might make more sense though, because it only requires
storing context and bit pointer per position, while methods based on rc require
context, low,range and byte pointer.

21. Sorry, I'm not very bright today
It's OK.

Real ordered bitcode still might make more sense though, because it only requires
storing context and bit pointer per position, while methods based on rc require
context, low,range and byte pointer.
If we take some two occurences of a n-gram from compressed output, then after subtracting 'low's from their beginning, the compressed bitstreams should be identical. So we can firstly sort & split by that prefixes that are affected by 'low' and then forget about contexts (or whatever) and use only bit pointers. Am I right?

22. 1. There's a question whether we use bitwise output (range can stay always equal to 2^31) or bytewise.
In the first case we'd only have to store bit pointer + low, but encoding would be much slower
(renorm loops and bitwise output) and we'd have to subtract low at bit offsets.

And for bytewise coder we'd have to store log2(range),low, and byte pointer.

2. As to "contexts", they're only necessary if we'd use order1+ statistics there.
In which case symbol intervals would be different in different contexts, so we'd
be only able to compare compressed strings with same contexts.
And after sorting we won't need pointers instead.

Btw, with order0 code we'd have to actually decode data bytes from the compressed
buffer, which is not very fast.

23. 1. We should use whatever that is faster. I think that even if we use bitwise renormalization then it shouldn't be much slower, as it outputs the same number of bytes and we can operate on logarithms (of base 2, of course) of ranges and frequencies (except cumFreq and lowc, but converting from log2 to full representation is just one lookup in a tiny LUT) so then finding how many bits to shift out is a constant time operation.

Update:
If we operate on logarithms of ranges then we can represent a 2^32 range!
Update end.

But should the type of renormalization affect the final output in this case?

2. By contexts I meant low and range.

Btw, with order0 code we'd have to actually decode data bytes from the compressed
buffer, which is not very fast.
Why? With prefix codes there's no decoding needed, for sure. With my proposal, when we subtract the initial (ie. occuring at current position) ari's low, then we should get an ordered string (ie. identical for any occurence).

PS:
If the above is true (ie. we don't need to decode anything before comparing) then *maybe* it could be suitable for my QRotSort.

I suppose here are results for o1 prefix coding. I wonder what would be the compression ratio of o2 huff-range (ie. variant described in this topic) coding.

24. > 1. We should use whatever that is faster. I think that even if we
> use bitwise renormalization then it shouldn't be much slower, as it
> outputs the same number of bytes

Ok, I actually have some stats (enwik:

http://nishi.dreamhosters.com/u/fpaq0pv4b4.rar

Code:
```     c.size   c.time  d.time
rc2d 61287234  2.969s 3.203s // rc with 16-bit output
rc2e 61256047  3.547s 3.641s // rc with bytewise output
rc2f 61256047 11.640s 7.438s // bitwise```
I guess bitwise encoding is especially slow because of these
cache bit output loops in ShiftLow().

Anyway, bitwise ari needs several extra branches and loops,
so it doesn't make any sense.

> converting from log2 to full representation is
> just one lookup in a tiny LUT)

We don't need a LUT for a shift operation

> so then finding how many bits to shift out is a constant time
> operation.

Its not, because number of bits to shift depends on low,
and we already agreed that its not 2^n.
Also there're some speed optimizations which could be normally applied,
like carryless coding, but here we can't use an optimization which affects
string comparisons.

> If we operate on logarithms of ranges then we can represent a 2^32 range!

We can represent it even without that, by implying +1.

> But should the type of renormalization affect the final output in this case?

Seemingly not, but byte-aligned low subtraction and so on would be much faster.

>> Btw, with order0 code we'd have to actually decode data bytes from the compressed
>> buffer, which is not very fast.

> Why? With prefix codes there's no decoding needed, for sure. With my
> proposal, when we subtract the initial (ie. occuring at current
> position) ari's low, then we should get an ordered string (ie.
> identical for any occurence).

I meant the fact that after sorting the pointers, we'd have to produce
the BWT... and if you suggest to keep and sort uncompressed bytes
along with the pointers, then there's no sense not to use order1 compression.

25. Don't forget about BWT output.
I meant the fact that after sorting the pointers, we'd have to produce
the BWT... and if you suggest to keep and sort uncompressed bytes
along with the pointers, then there's no sense not to use order1 compression.
OK. I thought you said I have to decode while sorting. Actually, I think that in case of my QRotSort it would be better to store pointer to uncompressed data together with pointer to compressed data, so that I can use radix doubling technique as finishing step. Naive sorting is too time consuming when average LCP is high.

We don't need a LUT for a shift operation
Duh, nevermind :P

Its not, because number of bits to shift depends on low,
and we already agreed that its not 2^n.
It depends on range, not low. In every coder I saw, renormalization loops looks like that:
Code:
```while (range < threshold) {
renorm();
}```
The only problem is managing carries, but it only adds some little complexity, and it doesn't require loops.

You're also showing bitwise coders, while I'm talking about bytewise coder.

26. > Actually, I think that in case of my QRotSort it would be better to
> store pointer to uncompressed data together with pointer to
> compressed data, so that I can use radix doubling technique as
> finishing step.
> Naive sorting is too time consuming when average LCP is high.

For redundant data, compression would be also good.

And don't forget the rcbwt3 idea about rangecoding strings from
each position into fixed-size (32-bit in my case) chunks.
Like that its possible to use normal arithmetic coding, without
any renorms even, and with plain byte pointers and nothing else...
but it would be still 8N... and while there's a linear way to
produce the codes (by adding a prefix interval), I couldn't find
a linear way to count how many symbols are stored in a value.
(To skip them when I see that two 32-bit values match).

> You're also showing bitwise coders, while I'm talking about bytewise coder.

I showed only one bitwise coder (rc2f), and you started it by saying
"I think that even if we use bitwise renormalization then it
shouldn't be much slower"

> I suppose here are results for o1 prefix coding. I wonder what would
> be the compression ratio of o2 huff-range (ie. variant described in
> this topic) coding.

book1 result with o2 ari is about 282-283k, huffman should be about 300k.
But o2 is problematic. You'd have to do 256^3 operations to build the code
(worst case, but still), and the encoding won't fit in L1 (code table
would take at least 128k), which would make it 3x slower than o1.

However, we don't have to use an integer number of bytes as context.
Nor we have to use sequential bits.
For example, a context mask 0x5F5F could provide better compression
than 0xFFFF on texts (because it does upcase), while only having 12 bits.

Btw, I'm still thinking about full ari without rounding.
Let's make it binary for simplicity (only 2 freqs per bit).
So then we only need f0+f1<=total (not "=") and totals can
be different for different bits.
Can't we somehow setup it so that it would work out
(/total *freq and renorms) without having to use 2^n freqs?

27. > You're also showing bitwise coders, while I'm talking about bytewise coder.

I showed only one bitwise coder (rc2f), and you started it by saying
"I think that even if we use bitwise renormalization then it
shouldn't be much slower"
I had considered (in my mind) a coder that takes a symbol as an input, codes it and then does "bitwise" renormalization, ie outputs 32 - logRange bits at once (so it's not necessary an integral number of bytes) to a temporary buffer and then outputs bytes from that temporary buffer. After encoding logRange will still be 32.

But o2 is problematic. You'd have to do 256^3 operations to build the code
(worst case, but still), and the encoding won't fit in L1 (code table
would take at least 128k), which would make it 3x slower than o1.
I'll have to weigh up the pros and cons of o1 and o2.

28. I have completed my ordered order-1 coder. It has compression equal to Huffman codes, but nevertheless my coded is much slower than Cyan's Huff0. Maybe Cyan encode/ decode more than one symbol at a time? I don't know.

On my computer (which is equal to Cyan's, ie I have Core 2 Duo E8400) my coder in order-0 mode has speed of about 85 MBps/ 110 MBps (encoding/ decoding) which is low compared to Cyan's about 200 MBps/ 200 MBps. In order-1 mode my coder has speed of about 68 MBps/ 80 MBps.

Shelwien's (or maybe Ondrus') optimal bitcodes of order-1 compress enwik8 to 53711205, while my coder compressed enwik8 to 49078804, ie it's about 9% smaller.

Here's the code - note that I've borrowed a skeleton from my rotation sorting programs and I've commented/ ommited/ not yet implemented some parts, so this serves only for testing purposes now:
Code:
```/*
* File:   main.cpp
* Author: Piotr Tarsa
*
*/

#include <cassert>
#include <cstdio>
#include <cstdlib>
#include <ctime>

#include <algorithm>
#include <queue>
#include <stack>
#include <utility>
#include <vector>

//#define USE_CONTEXT_0_NOT_1

struct _index_t {
int32_t index;
int32_t info;
};

typedef struct _index_t index_t;

const int32_t decodingLookupBits = 10;

struct _context_t {
u_int8_t index[256];
u_int8_t value[256];
u_int8_t logRange[256];
u_int32_t cumulativeRange[256];
u_int8_t decodedIndex[1 << decodingLookupBits];
int32_t numSymbols;
};

typedef struct _context_t context_t;

struct _huffman_node_t {
int32_t weight;
int32_t index;
int32_t leftChild;
int32_t rightChild;

int operator()(const _huffman_node_t &_this, const _huffman_node_t & _that) const {
return _this.weight > _that.weight;
}
};

typedef struct _huffman_node_t huffman_node_t;

size_t length;
int32_t *SA;
u_int8_t *sourceBuffer;
u_int8_t *outputBuffer;

template<typename T>
T min(T former, T latter) {
return former < latter ? former : latter;
}

template<typename T>
T max(T former, T latter) {
return former > latter ? former : latter;
}

int32_t clampOnce(int32_t value, int32_t limit) {
return (value >= limit) ? value - limit : value;
}

/*
*
*/
int main(int argc, char** argv) {

freopen("report.txt", "w", stderr);

//FILE *sourceFile = fopen("/home/piotrek/Dokumenty/BWT/qsufsort.c", "rb");
FILE *sourceFile = fopen("/home/piotrek/Pobrane/corpora/enwik/enwik8", "rb");

fseek(sourceFile, 0, SEEK_END);
length = ftell(sourceFile);
const int32_t DepthLimit = length;
printf("File length: %ld\n", length);
fseek(sourceFile, 0, SEEK_SET);

SA = new int32_t[length];

sourceBuffer = new u_int8_t[length];
outputBuffer = new u_int8_t[length];

fclose(sourceFile);

clock_t time = clock();

int32_t counts[256][256];
for (int32_t i = 0; i < 256; i++) {
for (int32_t j = 0; j < 256; j++) {
counts[i][j] = 0;
}
}

{
u_int8_t lastChar = sourceBuffer[length - 1];
for (u_int32_t i = 0; i < length; i++) {
#ifdef USE_CONTEXT_0_NOT_1
lastChar = 0;
#endif
const u_int8_t currentChar = sourceBuffer[i];
counts[lastChar][currentChar]++;
lastChar = currentChar;
}
}

context_t contexts[256];

{
huffman_node_t nodes[512];
std::priority_queue<huffman_node_t, std::vector<huffman_node_t>, huffman_node_t> heap;
for (int32_t context = 0; context < 256; context++) {
bool needRescaling = false;
do {
int32_t numSymbols = 0;
for (int32_t i = 0; i < 256; i++) {
if (counts[context][i] > 0) {
huffman_node_t huffmanNode;
huffmanNode.weight = counts[context][i];
contexts[context].value[numSymbols] = i;
contexts[context].index[i] = numSymbols;
huffmanNode.index = numSymbols++;
heap.push(huffmanNode);
}
}
contexts[context].numSymbols = numSymbols;
if (numSymbols == 0) {
continue;
}
int32_t firstFree = 0;
int32_t rootNode = 0;
while (true) {
huffman_node_t firstNode = heap.top();
heap.pop();
if (heap.empty()) {
rootNode = firstFree;
nodes[firstFree++] = firstNode;
break;
}
huffman_node_t secondNode = heap.top();
heap.pop();
huffman_node_t newNode;
newNode.weight = firstNode.weight + secondNode.weight;
newNode.index = -1;
newNode.leftChild = firstFree;
nodes[firstFree++] = firstNode;
newNode.rightChild = firstFree;
nodes[firstFree++] = secondNode;
heap.push(newNode);
}
int32_t currentLogRange = 32;
std::stack<std::pair<int32_t, int32_t> > stack;
int32_t currentNode = rootNode;
while (true) {
if (nodes[currentNode].index == -1) {
currentLogRange--;
needRescaling |= (currentLogRange < 0);
stack.push(std::make_pair(nodes[currentNode].rightChild, currentLogRange));
currentNode = nodes[currentNode].leftChild;
} else {
contexts[context].logRange[nodes[currentNode].index] = currentLogRange;
if (stack.empty()) {
break;
} else {
currentNode = stack.top().first;
currentLogRange = stack.top().second;
stack.pop();
}
}
}
u_int32_t currentLow = 0;
for (int32_t i = 0; i < numSymbols; i++) {
contexts[context].cumulativeRange[i] = currentLow;
currentLow += 1 << contexts[context].logRange[i];
}
u_int8_t decodedIndex = numSymbols - 1;
for (int32_t i = (1 << decodingLookupBits) - 1; i >= 0; i--) {
u_int32_t currentLow = i << (32 - decodingLookupBits);
while (contexts[context].cumulativeRange[decodedIndex] > currentLow) {
decodedIndex--;
}
contexts[context].decodedIndex[i] = decodedIndex;
}
if (needRescaling) {
for (u_int32_t symbol = 0; symbol < 256; symbol++) {
counts[context][symbol] += 1;
counts[context][symbol] /= 2;
}
puts("rescaling");
}
} while (needRescaling);
}
}

bool useCompressedInput = true;

{
int64_t totalCost = 0;
for (int32_t context = 0; context < 256; context++) {
for (int32_t symbol = 0; symbol < 256; symbol++) {
if (counts[context][symbol] > 0) {
totalCost += counts[context][symbol] *
(32 - contexts[context].logRange[contexts[context].index[symbol]]);
}
}
}
double ratio = (double) totalCost / 8. / length;
printf("Compression ratio: %lf\n", ratio);
if ((length < 1000) || (ratio > 0.9)) {
useCompressedInput = false;
}
}

printf("Process time: %ld\n", clock() - time);

// compute compressed input
if (useCompressedInput) {
time = clock();
for (u_int32_t i = 0; i < length; i++) {
outputBuffer[i] = 0;
}

int32_t outputIndex = 0;
int32_t numFFdwords = 0;
int32_t numTailBits = 0;
u_int32_t tailDword = 0;
u_int32_t currentLow = 0;

int32_t lastByte = sourceBuffer[length - 1];

for (u_int32_t i = 0; i <= length; i++) {
#ifdef USE_CONTEXT_0_NOT_1
lastByte = 0;
#endif
int32_t currentByte = (i == length) ? 0 : sourceBuffer[i];
int32_t currentIndex = (i == length) ? 0 : contexts[lastByte].index[currentByte];

int32_t currentLogRange = (i == length) ? 0 : contexts[lastByte].logRange[currentIndex];
int32_t bitsToShift = (i == length) ? 32 : (32 - currentLogRange);

u_int32_t newLow = currentLow + ((i == length) ? *(u_int32_t*) (sourceBuffer) :
contexts[lastByte].cumulativeRange[currentIndex]);

if (newLow < currentLow) { // overflow
if (numTailBits > 0) {
tailDword += 1 << (32 - numTailBits);
}
if (tailDword == 0) {
*(u_int32_t*) (outputBuffer + outputIndex) = headDword + 1;
outputIndex += 4;
while (numFFdwords > 0) {
numFFdwords--;
*(u_int32_t*) (outputBuffer + outputIndex) = 0;
outputIndex += 4;
}
numTailBits = 0;
} else if (tailDword == 0xffffffff) {
numFFdwords++;
tailDword = 0;
numTailBits = 0;
}
} else {
}
}
}

if (numTailBits + bitsToShift < 32) {
if (bitsToShift > 0) {
tailDword += (newLow >> currentLogRange) << (currentLogRange - numTailBits);
numTailBits += bitsToShift;
}
} else {
int32_t tailBitsToShift = 32 - numTailBits;

tailDword += newLow >> numTailBits;
if (tailDword == 0xffffffff) {
numFFdwords++;
} else {
*(u_int32_t*) (outputBuffer + outputIndex) = headDword;
outputIndex += 4;
while (numFFdwords > 0) {
numFFdwords--;
*(u_int32_t*) (outputBuffer + outputIndex) = 0xffffffff;
outputIndex += 4;
}
}
tailDword = (currentLogRange + tailBitsToShift < 32) ?
(newLow >> currentLogRange) << (currentLogRange + tailBitsToShift) : 0;
numTailBits = bitsToShift - tailBitsToShift;
}
} else {
if (numHeadBits + bitsToShift <= 32) {
} else {

tailDword = (newLow >> currentLogRange) << (currentLogRange + headBitsToShift);

}
}

currentLow = newLow << bitsToShift;
lastByte = currentByte;
}

*(u_int32_t*) (outputBuffer + outputIndex) = headDword;
outputIndex += 4;
while (numFFdwords > 0) {
numFFdwords--;
*(u_int32_t*) (outputBuffer + outputIndex) = 0xffffffff;
outputIndex += 4;
}
if (numTailBits > 0) {
*(u_int32_t*) (outputBuffer + outputIndex) = tailDword;
outputIndex += 4;
}
}

printf("Output index: %d\n", outputIndex);
printf("Process time: %ld\n", clock() - time);
}

if (true) {
// test print context
int32_t testedContext = sourceBuffer[length - 1];
context_t c = contexts[testedContext];
for (int32_t index = 0; index < c.numSymbols; index++) {
printf("%3d %3d %2d %8x\n", index, c.value[index], c.logRange[index],
c.cumulativeRange[index]);
}
puts("");
for (int32_t symbol = 0; symbol < 256; symbol++) {
if (counts[testedContext][symbol] > 0) {
printf("%3d %3d\n", symbol, c.index[symbol]);
}
}
for (int32_t low = 0; low < 1024; low++) {
printf("%3x %3d\n", low << 2, c.decodedIndex[low]);
}
}

// verify compressed input
if (useCompressedInput) {
bool foundBrokenIndex = false;
time = clock();
u_int32_t currentLow = *(u_int32_t*) (outputBuffer + 0);
u_int32_t bufferDword = *(u_int32_t*) (outputBuffer + 4);
int32_t bufferBits = 32;
int32_t outputIndex = 8;

int32_t lastByte = sourceBuffer[length - 1];
for (u_int32_t i = 0; i < length; i++) {
#ifdef USE_CONTEXT_0_NOT_1
lastByte = 0;
#endif
context_t &currentContext = contexts[lastByte];
int32_t currentIndex =
currentContext.decodedIndex[currentLow >> (32 - decodingLookupBits)];
while (((currentIndex + 1) < currentContext.numSymbols) &&
(currentLow >= currentContext.cumulativeRange[currentIndex + 1])) {
currentIndex++;
}
lastByte = currentContext.value[currentIndex];
if ((lastByte != sourceBuffer[i]) && (!foundBrokenIndex)) {
foundBrokenIndex = true;
printf("First broken index: %d\n", i);
}
int32_t bitsToShift = 32 - currentContext.logRange[currentIndex];
if (bitsToShift != 0) {
currentLow -= currentContext.cumulativeRange[currentIndex];
currentLow <<= bitsToShift;
currentLow += bufferDword >> (32 - bitsToShift);
if (bitsToShift < bufferBits) {
bufferDword <<= bitsToShift;
bufferBits -= bitsToShift;
} else if (bitsToShift > bufferBits) {
bufferDword = *(u_int32_t*) (outputBuffer + outputIndex);
outputIndex += 4;
bitsToShift -= bufferBits;
currentLow += bufferDword >> (32 - bitsToShift);
bufferDword <<= bitsToShift;
bufferBits = 32 - bitsToShift;
} else {
bufferDword = *(u_int32_t*) (outputBuffer + outputIndex);
outputIndex += 4;
bitsToShift -= bufferBits;
bufferBits = 32;
}
}
}
printf("Process time: %ld\n", clock() - time);
length = outputIndex;
}

fflush(stdout);

return 0;

for (u_int32_t i = 0; i < length; i++) {
SA[i] = i;
}

for (u_int32_t i = 0; i < length; i++) {
outputBuffer[i] = sourceBuffer[clampOnce(SA[i] + length - 1, length)];
}

{
clock_t time = clock();
for (u_int32_t i = 1; i < length; i++) {
int32_t leftPtr = SA[i - 1];
int32_t middlePtr = SA[i];
int32_t equal = 0;
for (; equal < DepthLimit; equal++) {
if (sourceBuffer[leftPtr] > sourceBuffer[middlePtr]) {
printf("Wrong order on position: %d, equal prefix length: %d\n", i - 1, equal);
break;
} else if (sourceBuffer[leftPtr] < sourceBuffer[middlePtr]) {
break;
}
leftPtr = ((leftPtr + 1) == length) ? 0 : leftPtr + 1;
middlePtr = ((middlePtr + 1) == length) ? 0 : middlePtr + 1;
}
}
printf("Rotation order checking time: %ld\n", clock() - time);
}

FILE *outputFile = fopen("/home/piotrek/Pulpit/bwtoutput.mergerotsort", "wb");

size_t writtenBytes = fwrite(outputBuffer, sizeof (u_int8_t), length, outputFile);
assert(writtenBytes == length);

fclose(outputFile);

return 0;
}```

29. Huff0 only outputs one byte at a time (except when some segments are declared uncompressible, or one-byte repetitive, but this is not common situation).

The speed comes from the trivial main loop.
And the main loop is trivial because Huff0 is block based, which means statistics are only updated at block boundaries.

I have not read your code in detail, but it seems to be much more dynamic, hence its superior compression capability.
Btw, 68 MBps/ 80 MBps in order-1 mode, i think it is great speed !

30. I have not read your code in detail, but it seems to be much more dynamic, hence its superior compression capability.
It is not dynamic. In fact it is completely static, ie. codes are computed only once per entire file.
I have optimized my codec for order-1 coding, maybe with optimization for order-0 I could get significantly closer to yours coder performance. But order-0 is not of a particular interest for me as it offers noticeably lower compression and maybe that loss of efficiency will make it infeasible for block-sorting.
My coder has equal coding efficiency to Huffman codes (maybe except the last renormalizations and flushes so that the output is a few, I think < 8, bytes longer) and the superiority to Shelwien's approach comes from the fact that ordered bitcodes are inefficient compared to Huffman ones.
Btw, 68 MBps/ 80 MBps in order-1 mode, i think it is great speed !
Glad to see that I also think it is good.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•