Page 1 of 2 12 LastLast
Results 1 to 30 of 35

Thread: LZARI

  1. #1
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Okay, today I wrote a simple LZ77 based compressor. There is no super results to share, however, this compressor works!

    For decompression we need only buffer size + memory needed for literal/offset models = ~1 MB total memory usage.

    Two problems and differences from ROLZ:
    1. We must drop the short matches at long distances
    2. How to encode match position

    Current solutions:
    1. I make some threshold for a few matches:
    if (maxlen == 3 && pos >= (1 << 12))
    maxlen = 0;
    if (maxlen == 4 && pos >= (1 << 14))
    maxlen = 0;
    if (maxlen == 5 && pos >= (1 << 16))
    maxlen = 0;
    if (maxlen == 6 && pos >= (1 << 1)
    maxlen = 0;
    Or something like that...

    Note that I use 1 MB buffer. And in this experimental version I encode each buffer separetely.

    2. For literals I use a bit-oriented arithmetic encoder.

    Literals/Match lengths as with LZH represents one alphabet:
    0...255 - literals
    256...511 - match lengths from 3 to 258

    So, how to encode position? Note that we have (1 << 20) or 1 MB of values. With this version I use the 20-bit alphabet - i.e. each bit of match position encoded separetely (as with symbols). I think this is a bad approach, however the idea is interesting...

  2. #2
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Again, I can add some tricks from RAR:
    if (Distance>=0x2000)
    {
    Length++;
    if (Distance>=0x40000L)
    Length++;
    }

  3. #3
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    if (Distance>=0x101)
    {
    Length++;
    if (Distance>=0x2000)
    {
    Length++;
    if (Distance>=0x40000)
    Length++;
    }
    }

  4. #4
    Programmer
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    420
    Thanks
    28
    Thanked 153 Times in 18 Posts
    Quote Originally Posted by encode
    So, how to encode position? Note that we have (1 << 20) or 1 MB of values.
    This is very easy. Just use some kind of bin/exp coding. Should look something like this:

    void code(int c)
    {
    if (c<20)
    put(c+20);
    else
    {
    int b;
    c-=20;
    for (b=1;c>=(1<<b); )
    c-=1<<b++;
    put(b-1);
    store_bits(c,b)
    }
    }

    Note, that you can use different bins sizes. And you can speed the bin searching up (binary search). But as always, this only makes sense if there are enough bins. Put(..) uses a 40 symbol alphabet.

  5. #5
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    I know about trees, for example, deflate uses something like this. I use a binary arithmetic encoder. What do you think - is it worth to use some kind of modelling for match positions - i.e. keep counters etc. or just use binary tree?

    header->footer


  6. #6
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    encoder.Encode(1, (PSCALE / 2)); // store bit 1

  7. #7
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    AFAIK, original ROLZ uses a binary three for match indexes (with ROLZ2/ROLZ3 we have up to 64000 indexes).

    Okay, do you know any efficient algorithm for such binary code generation, say for 1, 2 or 4 MB positions?

    encoder.Encode(1, (PSCALE / 2)); // store bit 1
    encoder.Encode(0, (PSCALE / 2)); // store bit 0


  8. #8
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by Malcolm Taylor
    ...
    These offsets can be encoded in much the same way as with most lz77,
    with an exponential type encoding - encode a bucket number, and then
    encode the position in the bucket. Bucket sizes increase for each bucket
    number (not necessarily linearly). Check out CAB offset encoding,
    deflate and 7zip which all use variations on this.

  9. #9
    Programmer
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    420
    Thanks
    28
    Thanked 153 Times in 18 Posts
    What Malcom writes is about the same. The Put(..) in my example codes the bin/bucket-number. store_bits(..) codes the position in the Bucket. Just look at the example again.

    Quote Originally Posted by encode
    I use a binary arithmetic encoder. What do you think
    Just make the number of bins (40 in the example) a power of two and use the binary arithmetic coder on the bin-number alphabet.

  10. #10
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Thank you! Just tested!

    However, something wrong here - my 20-bit encoding works better than this bin coding... I think I must make more experiments...

  11. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    encodiing distances:

    classical reading in this area is APPNOTE.TXT, method 8 (and 6). russian translation: zip.doc in http://www.haskell.org/bz/lzh.7z . then you can read cabarc format explanation LZXFMT.DOC, it's again very helpful

    now you can start groking sources, such as trees.c from zip or huf.c from ar002 (it was first compressor invited blockwise static huffman)

    fortunately, you don't need to reimplement this all yourself - it's better to reuse wheel instead of reinventing it just look how these procedures are used in calling modules and you are here. in particular, you may get modified trees.c from my arjz. cabarc-compatible routines are also available at special request

    alternatively, you can got these routines from 7zip sources - ari, huffman, anything you may imagine

    to be exact, i say about whole procedures implementing block-huffman encoding. if you need only good encoding for distances, look here (code from arjz):

    #define D_CODES (REPDIST_CODES+60)
    /* number of distance codes */

    local int near extra_dbits[D_CODES] /* extra bits for each distance code */
    = {0,0,0,0, // REP_DIST codes
    0,0,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,
    9,9,10,10,11,11,12,12,13,13,14,14,
    15,15,16,16,17,17,18,18,19,19,20,20,21,21,22,22,23 ,23,24,24,
    25,25,26,26,27,27,28,28};

    #define DIST_CODE_SIZE (1024+4096)

    local uch dist_code[DIST_CODE_SIZE];
    /* distance codes. The first 512 values correspond to the distances
    * 3 .. 514, the last 256 values correspond to the top 8 bits of
    * the 16 bit distances.
    */

    local uint near base_dist[D_CODES];
    /* First normalized distance for each code (0 = distance of 1) */

    #define d_code(dist)
    ((dist) < 512 ? dist_code[dist] :
    ((dist) < 512*256 ? dist_code[512+((dist)>>]
    : dist_code[1024+((dist)>>16)]))
    /* Mapping from a distance to a distance code. dist is the distance - 1 and
    * must not have side effects. dist_code[512], [513], [1024], [1025] are never
    * used.
    */


    This code initializes distance mapping arrays

    /* Initialize the mapping dist (0..64K) -> dist code (4..63) */
    dist = 0;
    for (code = REPDIST_CODES; code < REPDIST_CODES+16+2; code++) {
    base_dist[code] = dist;
    for (n = 0; n < (1<<extra_dbits[code]); n++) {
    dist_code[dist++] = (uch)code;
    }
    }
    Assert (dist == 512, "ct_init: dist != 512");
    dist >>= 8; /* from now on, all distances are divided by 256 */
    for ( ; code < REPDIST_CODES+32+2; code++) {
    base_dist[code] = dist << 8;
    for (n = 0; n < (1<<(extra_dbits[code]-); n++) {
    dist_code[512 + dist++] = (uch)code;
    }
    }
    Assert (dist == 512, "ct_init: 512+dist != 1024");
    dist >>= 8; /* from now on, all distances are divided by 65536 */
    for ( ; code < REPDIST_CODES+28*2; code++) { // distances up to 256M
    base_dist[code] = dist << 16;
    for (n = 0; n < (1<<(extra_dbits[code]-16)); n++) {
    dist_code[1024 + dist++] = (uch)code;
    }
    }
    Assert (dist == 4096, "ct_init: 1024+dist != 5120");


    This code encodes distance:

    /* Here, dist is the match distance - 1 */
    code = d_code(dist);
    Assert (code < D_CODES, "bad d_code");
    send_code(code, dtree); /* send the distance code */

    extra = extra_dbits[code];
    if (extra != 0) {
    dist -= base_dist[code];
    send_bits(dist, extra); /* send the extra distance bits */
    }

  12. #12
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by huf.c
    static void encode_p(uint p)
    {
    uint c, q;

    c = 0; q = p; while (q) { q >>= 1; c++; }
    putbits(pt_len[c], pt_code[c]);
    if (c > 1) putbits(c - 1, p & (0xFFFFU >> (17 - c)));
    }
    By the way, Im greatest fan of Haruhiko Okumura!

    One question about arjz implementation - how to simplify things?

    From ar (huf.c) we can see we need only two tables. These tables filled somethere.

    So even, as I understood, no reason to encode BIN with arithmetic encoder - we encode distance entirely with blockwise static huffman.

    2 Bulat:
    Dou you know the simplest/smallest code to encode distances? Unfortunately, Im not so familar with Huffman encoding - the idea is very simple, and I, of course, understand the principles, however, I havent my own Huffman coder...

  13. #13
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Fortunately, the LZ77 is well known area.

    So I think, the distance coding from best LZ77 coders like cabarc or LZMA will be the best for my shot.

    Also, we must drop short matches at long distances. Probably even at matching.

    Bulat, do you know the efficient formula to:
    1. Checking during matching - say we have match length of 3 at 512. After that we find the match length of 4 at, say, 65536. Of course we must choose the first match.
    2. Dropping short matches at large distances. Say, after search we have found match of length 5 at 512 MB distance - we must drop it...

    And again, well known schemes must work with my LZ.

  14. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    as you see in unrar sources, len=2 dropped with dist>256, 3 - 4096 and 4 - 256k (and their codes are reused for larger strings). for first question, the heurestic you can find in arjz sources is the:

    if (new_match_len==old_match_len+1 && new_dist>old_dist*64)
    drop_new_match



    d_code macro replaces this slow cycle:
    c = 0; q = p; while (q) { q >>= 1; c++; }

    extra_dbits array allows to have flexible distances mapping while in ar002 number of extra bits is always c-1

  15. #15
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    after all, idea of logarithmic representation is very simple: you map distance into some small enough code and calculate number of lower bits you should sent non-encoded

    in ar002, code is just logb of value and additionally all bits except for highest one should be sent non-encoded

    in zip, there are two codes per each power of two (read zip's description!). this code does the same encoding without using tables:

    #include <stdio.h>

    int main(void)
    {
    for ( int i=0; i<2000; i++)
    {
    int t=i,n;
    for(n=0; t>3; n++,t/=2);
    printf ("%4d %4d %4d
    ", i, n*2+t, n);
    }
    return 0;
    }

    it prints encoded value (1-2000), its code and amount of additional bits

  16. #16
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Thank you Bulat!

    Link:
    lzari.zip (20 KB)

    A fun art from me! Super draft version with poor compression and speed... However, it's works.

    Also, I think I should look into the BWT - I have an idea to make something like a fast-BBB. A BWT with strong output coding - like bit-oriented arithmetic or order-0 PAQ-inspired encoder...


  17. #17
    Member
    Join Date
    Dec 2006
    Posts
    611
    Thanks
    0
    Thanked 1 Time in 1 Post
    Results on my testset:
    all 15.812.052, compression 908kB/s, decompression 4000kB/s
    exe 747.792
    dll 1.641.758

  18. #18
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Thanks BF!

    Note that LZARI has no E8/E9 transformer and uses Greedy Parsing...

  19. #19
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Arithmetic gives higer compression compared to the huffman. Indeed, I tried to implement this static huffman for distances and looks like my approach works better - i.e. coding the 20-bit value via bit-oriented arith. Probably if I initialize counters properly, this will give another boost.

    But one thing I cannot understand, why performance is so poor? (is equal to Deflate).

    Probably the key is in inefficient LZ-output coding. Or I implemented wrong static blockwise huffman (I tried compresison only).


  20. #20
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    I believe that my match finder finds all matches. Plus tricky sliding window is also works...

  21. #21
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts
    Quote Originally Posted by encode
    A fun art from me! Super draft version with poor compression and speed... However, its works
    Thanks Ilia!

  22. #22
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    Quote Originally Posted by encode
    But one thing I cannot understand, why performance is so poor? (is equal to Deflate).
    i should see?

  23. #23
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    I expect that this scheme will give higher compression...

    However, playing with the baseline LZ77 I get some ideas for ROLZ compression...

  24. #24
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    if you thought that you can easily get lzma-like results - sorry, it uses a lot of tricks. seems that you still dont read even zip docs?

    Quote Originally Posted by encode
    However, playing with the baseline LZ77 I get some ideas for ROLZ compression...
    i also thinks that there are still enough room for improvement. how about some combined scheme, that uses lz77 for small near strings? or rk-like scheme where search performed in large buffer of strings with the same order-1 context?

  25. #25
    Member
    Join Date
    Dec 2006
    Posts
    611
    Thanks
    0
    Thanked 1 Time in 1 Post

  26. #26
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by Bulat Ziganshin
    seems that you still dont read even zip docs?
    Quote Originally Posted by Black_Fox
    Be sure, I read ALL docs.

    Read something and understand is one thing, write own program is different one. Thats why I just ask people like Bulat about pure LZ77 and LZH.

    Currently, Im writing a new experimental compressor. With one thing in mind - get a higher binary compression with simple decoder. I hope the first public version will be available soon. I will keep name of the compressor in secret, but I can say I got some interesting results, even compared to the QUAD.

  27. #27
    Member
    Join Date
    Dec 2006
    Posts
    611
    Thanks
    0
    Thanked 1 Time in 1 Post
    A bit offtopic maybe - how difficult is it to make a run-time packer? Will you try it when you get LZARI to compress better?

  28. #28
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by Black_Fox
    A bit offtopic maybe - how difficult is it to make a run-time packer?
    You mean EXE-packer? Or super-fast compressor? Im writing a new engine from scratch!

  29. #29
    Member
    Join Date
    Dec 2006
    Posts
    611
    Thanks
    0
    Thanked 1 Time in 1 Post
    I mean exe/dll packer

  30. #30
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by Black_Fox
    I mean exe/dll packer
    No, I havent such plans. The main reason is the new engine stays in very early stage of development - its a new big trip...

Page 1 of 2 12 LastLast

Similar Threads

  1. Tornado - fast lzari compressor
    By Bulat Ziganshin in forum Forum Archive
    Replies: 23
    Last Post: 27th July 2007, 14:26

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •