Results 1 to 9 of 9

Thread: Variable-length Offset

  1. #1
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts

    Variable-length Offset

    Hi

    I've been trying lately to find a simple way to create a dynamic-length offset coding (This could also be usefull for Match Length, and even Literal count). Believing that this was already done by many other contributions, i digged my way into google, with surprisingly few result.
    The only result i could find is a mention of Huffman encoding, which surprised me, because Huffman looks a good way to improve "specific values", but less suitable for complete ranges (or there is something that i've not understood then).

    In order to illustrate what i'm looking for, here is an example :

    Code:
    Fixed Size Offset
    Bit.range  	1	2	3	4	5	6	7	8	9	10	11	12	13	14		
    Signal.bits	0	0	0	0	0	0	0	0	0	0	0	0	0	0		
    Offset.bits	14	14	14	14	14	14	14	14	14	14	14	14	14	14		
    Total.bits	14	14	14	14	14	14	14	14	14	14	14	14	14	14		
    Occurence	20	5	5	5	6	10	20	30	40	45	45	35	25	15		
    Total.bits	280	70	70	70	84	140	280	420	560	630	630	490	350	210		4284
    
    
    Step 4	Multi range with Huffman header															
    Bit.range  	1	2	3	4	5	6	7	8	9	10	11	12	13	14		
    Signal.bits	3	3	3	2	2	2	2	2	2	1	1	1	3	3		
    Offset.bits	3	3	3	9	9	9	9	9	9	12	12	12	14	14		
    Total.bits	6	6	6	11	11	11	11	11	11	13	13	13	17	17		
    Occurence	20	5	5	5	6	10	20	30	40	45	45	35	25	15		
    Total.bits	120	30	30	55	66	110	220	330	440	585	585	455	425	255		3706
    Range Occ.	20	25	30	5	11	21	41	71	111	45	90	125	25	40

    The above example can be computed relatively simply. Huffman is used to select optimal header (Signal.bits), but not for selecting the ranges themselves.
    The calculation has also a few inefficiencies. For example, the 1st range uses 3 bits (1-, the 2nd range uses 9 bits, but starts at 9 instead of 1. So the second range is [10-521], and value 520, which is counted in the 3rd range in this calculation (12 bits), will be in the second range instead. This will save a few extra-bits, as a consequence the final number of bits is likely to be better than the one calculated. I doubt that this would change the range selection, but it has to be acknowledged that this effect is not taken into consideration by the algorithm.

    Once again, this is just "homework", but i'm sure this has already been extensively tested/researched elsewhere, and just need to turn my head towards the right direction.

    Any idea ?
    Last edited by Cyan; 10th December 2008 at 17:15.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    > Believing that this was already done by many other
    > contributions, i digged my way into google, with
    > surprisingly few result.

    Well, wrong keywords, probably.
    At least, there's quite a number of LZ implementations
    and open-source audiocodecs, which use something similar.

    Eg. http://en.wikipedia.org/wiki/Golomb_coding

    > The only result i could find is a mention of Huffman
    > encoding, which surprised me, because Huffman looks a good
    > way to improve "specific values", but less suitable for
    > complete ranges (or there is something that i've not
    > understood then).

    Huffman coding gives the optimal codes for any alphabet,
    ranges or whatever, but you have to supply the correct
    statistics (numbers of occurences).

    Also its common to use a blockwise static huffman coding,
    with rather small blocks, like 32k.

    Another commonly mentioned optimization is modelling
    an offset/length pair as a single value (it improves
    the coding precision), but somehow it never really
    gets implemented.

    > Any idea ?

    In fact, I have no idea as to what's your question

  3. #3
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    > Believing that this was already done by many other
    > contributions, i digged my way into google, with
    > surprisingly few result.

    Well, wrong keywords, probably.
    Indeed. And I would be glad to learn which keywords would provide good search results.


    Thanks for the link. I will most certainly need some time to digest its content, as it seems beyond my limited neuronal capacity right now. All i could get is the clever idea to separate values into an upper and a lower part. That's barely scratching the surface of the concept.

    Huffman coding gives the optimal codes for any alphabet,
    ranges or whatever, but you have to supply the correct
    statistics (numbers of occurences).
    Exactly, i fully agree.
    Now if i want to compress a suite of numbers, whose value are between 1-16K, would it be practical to consider it an alphabet of 16K elements, and start using Huffman on it ? I don't think so, but maybe i'm missing something.


    In fact, I have no idea as to what's your question
    OKOK, actually there are so many possibilities to achieve the same objective that it can be confusing. As a beginner, i'm only really interested in the most simple schemes, even if more powerfull ones do exist.

    This started with a previous post from Ilia :
    http://encode.su/forum/showthread.php?t=143&page=2

    Here the ideas are very simple : just try different variable-length offset format, to select the better one.

    For example:

    Small LZ-code (16-bit):
    1-bit - Small/Large Flag (0)
    3-bit - Match Length
    12-bit - Match Offset

    Large LZ-code (24-bit):
    1-bit - Small/Large Flag (1)
    6-bit - Match Length
    17-bit - Match Offset
    Basically, to sum up, experimenting with different possible Offset format, he ended with the not-surprising result that changing the format would benefit some files, and worsen others.

    So i was wondering if it could be possible to calculate the optimal offset format for the current file, and while at it, look for deeper format than a simple "bit control switch" between 2 offset length. To concentrate on this exercice, i'm not considering the "byte-aligned" restriction of LZSS here.
    I think i've figured a way to do it, but once again, i'm very surprised to not have found something equivalent (and most probably better) on the net. As you said, wrong keyword probably.
    That's what i'm looking for, if it exists.
    Last edited by Cyan; 14th December 2008 at 21:43.

  4. #4
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,985
    Thanks
    377
    Thanked 353 Times in 141 Posts
    One of the simplest and well known idea about variable offset and match length coding is Gamma codes.

    Such coding scheme may be easily implemented using assembler:

    8-bit match length can be coded just as gamma code:

    0 - 00
    1 - 10
    2 - 1100
    3 - 1110
    4 - 110100
    5 - 110110

    bold - control bit - 1 means read another bit, 0 - means end of the code

    Match offset can be coded as follows:

    low 8-bits just stored - i.e. output one byte

    high 8-bits (for 16-bit offset) are gamma coded

    Look for fast and OpenSource LZ77 coders like BriefLZ.

  5. #5
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Thanks Ilia.
    Indeed, Gamma codes are exactly the kind of reference i was looking for. Thanks for pointing this out.

    After analysis, Gamma Code seems less efficient than Huffman-based header (or i'm missing something).

    Mathematically, i guess this is logical, because the stop/continue bit is not giving more information than a header bit; but, the method restrict the possibilities to the following format :
    0
    10
    110
    1110
    etc.

    While Huffman header would autorize more possibilities (so less redundancy)
    000 001 01 10 11

    To provide a comparison, such scheme would result in the following count :

    Code:
    Step 3	Gamma Coding															
    Bit.range  	1	2	3	4	5	6	7	8	9	10	11	12	13	14		
    Signal.bits	1	1	1	1	1	1	1	1	1	2	2	2	3	3		
    Data.bits	9	9	9	9	9	9	9	9	9	12	12	12	14	14		
    Total.bits	10	10	10	10	10	10	10	10	10	14	14	14	17	17		
    Occurence	20	5	5	5	6	10	20	30	40	45	45	35	25	15		
    Total.bits	200	50	50	50	60	100	200	300	400	630	630	490	425	255		3840
    3840 is better than "fixed size" (4284), but less savings than the previously proposed "optimized" format (3706). This is because the optimisation process is the same, except that i can only optimise the right side (upper nb bits).

    Well, that is, except if there is something about Gamma code that i've not properly understood...
    Last edited by Cyan; 14th December 2008 at 21:43.

  6. #6
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,985
    Thanks
    377
    Thanked 353 Times in 141 Posts
    JCALG1/jcalg1_d.asm
    Code:
    getgamma:
        xor    ecx,ecx
        inc    ecx
    getgammaloop:
        call    getbit
        adc    ecx, ecx
        call    getbit
        jc    getgammaloop
        db    0c3h
    i.e.
    Code:
    do {
      value+=(value+getbit());
    }
    while (getbit());

  7. #7
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Thanks again Ilia

    The programs you linked are very interesting reading through, so i'll take some time lo learn from them.

    Regards
    Last edited by Cyan; 14th December 2008 at 21:44.

  8. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    >Now if i want to compress a suite of numbers, whose value are between 1-16K, would it be practical to consider it an alphabet of 16K elements, and start using Huffman on it ? I don't think so, but maybe i'm missing something.

    std technique is to split it into ~100 intervals of almost the same probability and encode two things - interval number (using huffman) and position inside the interval.

    http://www.haskell.org/bz/lzh.7z contains docs about most popular lzh-based formats - zip and cabarc-lzx

  9. #9
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Splitting numbers into higher & lower parts, quotient & rest, interval nb & position, well, this starts to make sense. Indeed, several hints are pointing into the same direction.

    Thanks for advice, Bulat

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •