1. ## Variable-length Offset

Hi

I've been trying lately to find a simple way to create a dynamic-length offset coding (This could also be usefull for Match Length, and even Literal count). Believing that this was already done by many other contributions, i digged my way into google, with surprisingly few result.
The only result i could find is a mention of Huffman encoding, which surprised me, because Huffman looks a good way to improve "specific values", but less suitable for complete ranges (or there is something that i've not understood then).

In order to illustrate what i'm looking for, here is an example :

Code:
```Fixed Size Offset
Bit.range  	1	2	3	4	5	6	7	8	9	10	11	12	13	14
Signal.bits	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Offset.bits	14	14	14	14	14	14	14	14	14	14	14	14	14	14
Total.bits	14	14	14	14	14	14	14	14	14	14	14	14	14	14
Occurence	20	5	5	5	6	10	20	30	40	45	45	35	25	15
Total.bits	280	70	70	70	84	140	280	420	560	630	630	490	350	210		4284

Step 4	Multi range with Huffman header
Bit.range  	1	2	3	4	5	6	7	8	9	10	11	12	13	14
Signal.bits	3	3	3	2	2	2	2	2	2	1	1	1	3	3
Offset.bits	3	3	3	9	9	9	9	9	9	12	12	12	14	14
Total.bits	6	6	6	11	11	11	11	11	11	13	13	13	17	17
Occurence	20	5	5	5	6	10	20	30	40	45	45	35	25	15
Total.bits	120	30	30	55	66	110	220	330	440	585	585	455	425	255		3706
Range Occ.	20	25	30	5	11	21	41	71	111	45	90	125	25	40```

The above example can be computed relatively simply. Huffman is used to select optimal header (Signal.bits), but not for selecting the ranges themselves.
The calculation has also a few inefficiencies. For example, the 1st range uses 3 bits (1-, the 2nd range uses 9 bits, but starts at 9 instead of 1. So the second range is [10-521], and value 520, which is counted in the 3rd range in this calculation (12 bits), will be in the second range instead. This will save a few extra-bits, as a consequence the final number of bits is likely to be better than the one calculated. I doubt that this would change the range selection, but it has to be acknowledged that this effect is not taken into consideration by the algorithm.

Once again, this is just "homework", but i'm sure this has already been extensively tested/researched elsewhere, and just need to turn my head towards the right direction.

Any idea ?

2. > Believing that this was already done by many other
> contributions, i digged my way into google, with
> surprisingly few result.

Well, wrong keywords, probably.
At least, there's quite a number of LZ implementations
and open-source audiocodecs, which use something similar.

Eg. http://en.wikipedia.org/wiki/Golomb_coding

> The only result i could find is a mention of Huffman
> encoding, which surprised me, because Huffman looks a good
> way to improve "specific values", but less suitable for
> complete ranges (or there is something that i've not
> understood then).

Huffman coding gives the optimal codes for any alphabet,
ranges or whatever, but you have to supply the correct
statistics (numbers of occurences).

Also its common to use a blockwise static huffman coding,
with rather small blocks, like 32k.

Another commonly mentioned optimization is modelling
an offset/length pair as a single value (it improves
the coding precision), but somehow it never really
gets implemented.

> Any idea ?

In fact, I have no idea as to what's your question

3. > Believing that this was already done by many other
> contributions, i digged my way into google, with
> surprisingly few result.

Well, wrong keywords, probably.
Indeed. And I would be glad to learn which keywords would provide good search results.

Thanks for the link. I will most certainly need some time to digest its content, as it seems beyond my limited neuronal capacity right now. All i could get is the clever idea to separate values into an upper and a lower part. That's barely scratching the surface of the concept.

Huffman coding gives the optimal codes for any alphabet,
ranges or whatever, but you have to supply the correct
statistics (numbers of occurences).
Exactly, i fully agree.
Now if i want to compress a suite of numbers, whose value are between 1-16K, would it be practical to consider it an alphabet of 16K elements, and start using Huffman on it ? I don't think so, but maybe i'm missing something.

In fact, I have no idea as to what's your question
OKOK, actually there are so many possibilities to achieve the same objective that it can be confusing. As a beginner, i'm only really interested in the most simple schemes, even if more powerfull ones do exist.

This started with a previous post from Ilia :

Here the ideas are very simple : just try different variable-length offset format, to select the better one.

For example:

Small LZ-code (16-bit):
1-bit - Small/Large Flag (0)
3-bit - Match Length
12-bit - Match Offset

Large LZ-code (24-bit):
1-bit - Small/Large Flag (1)
6-bit - Match Length
17-bit - Match Offset
Basically, to sum up, experimenting with different possible Offset format, he ended with the not-surprising result that changing the format would benefit some files, and worsen others.

So i was wondering if it could be possible to calculate the optimal offset format for the current file, and while at it, look for deeper format than a simple "bit control switch" between 2 offset length. To concentrate on this exercice, i'm not considering the "byte-aligned" restriction of LZSS here.
I think i've figured a way to do it, but once again, i'm very surprised to not have found something equivalent (and most probably better) on the net. As you said, wrong keyword probably.
That's what i'm looking for, if it exists.

4. One of the simplest and well known idea about variable offset and match length coding is Gamma codes.

Such coding scheme may be easily implemented using assembler:

8-bit match length can be coded just as gamma code:

0 - 00
1 - 10
2 - 1100
3 - 1110
4 - 110100
5 - 110110

bold - control bit - 1 means read another bit, 0 - means end of the code

Match offset can be coded as follows:

low 8-bits just stored - i.e. output one byte

high 8-bits (for 16-bit offset) are gamma coded

Look for fast and OpenSource LZ77 coders like BriefLZ.

5. Thanks Ilia.
Indeed, Gamma codes are exactly the kind of reference i was looking for. Thanks for pointing this out.

After analysis, Gamma Code seems less efficient than Huffman-based header (or i'm missing something).

Mathematically, i guess this is logical, because the stop/continue bit is not giving more information than a header bit; but, the method restrict the possibilities to the following format :
0
10
110
1110
etc.

While Huffman header would autorize more possibilities (so less redundancy)
000 001 01 10 11

To provide a comparison, such scheme would result in the following count :

Code:
```Step 3	Gamma Coding
Bit.range  	1	2	3	4	5	6	7	8	9	10	11	12	13	14
Signal.bits	1	1	1	1	1	1	1	1	1	2	2	2	3	3
Data.bits	9	9	9	9	9	9	9	9	9	12	12	12	14	14
Total.bits	10	10	10	10	10	10	10	10	10	14	14	14	17	17
Occurence	20	5	5	5	6	10	20	30	40	45	45	35	25	15
Total.bits	200	50	50	50	60	100	200	300	400	630	630	490	425	255		3840```
3840 is better than "fixed size" (4284), but less savings than the previously proposed "optimized" format (3706). This is because the optimisation process is the same, except that i can only optimise the right side (upper nb bits).

Well, that is, except if there is something about Gamma code that i've not properly understood...

6. JCALG1/jcalg1_d.asm
Code:
```getgamma:
xor    ecx,ecx
inc    ecx
getgammaloop:
call    getbit
call    getbit
jc    getgammaloop
db    0c3h```
i.e.
Code:
```do {
value+=(value+getbit());
}
while (getbit());```

7. Thanks again Ilia

The programs you linked are very interesting reading through, so i'll take some time lo learn from them.

Regards

8. >Now if i want to compress a suite of numbers, whose value are between 1-16K, would it be practical to consider it an alphabet of 16K elements, and start using Huffman on it ? I don't think so, but maybe i'm missing something.

std technique is to split it into ~100 intervals of almost the same probability and encode two things - interval number (using huffman) and position inside the interval.

http://www.haskell.org/bz/lzh.7z contains docs about most popular lzh-based formats - zip and cabarc-lzx

9. Splitting numbers into higher & lower parts, quotient & rest, interval nb & position, well, this starts to make sense. Indeed, several hints are pointing into the same direction.