1. Okay, today I wrote a simple LZ77 based compressor. There is no super results to share, however, this compressor works!

For decompression we need only buffer size + memory needed for literal/offset models = ~1 MB total memory usage. Two problems and differences from ROLZ:
1. We must drop the short matches at long distances
2. How to encode match position

Current solutions:
1. I make some threshold for a few matches:
if (maxlen == 3 && pos >= (1 << 12))
maxlen = 0;
if (maxlen == 4 && pos >= (1 << 14))
maxlen = 0;
if (maxlen == 5 && pos >= (1 << 16))
maxlen = 0;
if (maxlen == 6 && pos >= (1 << 1 )
maxlen = 0;
Or something like that...

Note that I use 1 MB buffer. And in this experimental version I encode each buffer separetely.

2. For literals I use a bit-oriented arithmetic encoder.

Literals/Match lengths as with LZH represents one alphabet:
0...255 - literals
256...511 - match lengths from 3 to 258

So, how to encode position? Note that we have (1 << 20) or 1 MB of values. With this version I use the 20-bit alphabet - i.e. each bit of match position encoded separetely (as with symbols). I think this is a bad approach, however the idea is interesting...  2. Again, I can add some tricks from RAR:
if (Distance>=0x2000)
{
Length++;
if (Distance>=0x40000L)
Length++;
}  3. if (Distance>=0x101)
{
Length++;
if (Distance>=0x2000)
{
Length++;
if (Distance>=0x40000)
Length++;
}
} 4. Originally Posted by encode
So, how to encode position? Note that we have (1 << 20) or 1 MB of values.
This is very easy. Just use some kind of bin/exp coding. Should look something like this:

void code(int c)
{
if (c<20)
put(c+20);
else
{
int b;
c-=20;
for (b=1;c>=(1<<b); )
c-=1<<b++;
put(b-1);
store_bits(c,b)
}
}

Note, that you can use different bins sizes. And you can speed the bin searching up (binary search). But as always, this only makes sense if there are enough bins. Put(..) uses a 40 symbol alphabet. 5. I know about trees, for example, deflate uses something like this. I use a binary arithmetic encoder. What do you think - is it worth to use some kind of modelling for match positions - i.e. keep counters etc. or just use binary tree?  6. encoder.Encode(1, (PSCALE / 2)); // store bit 1 7. AFAIK, original ROLZ uses a binary three for match indexes (with ROLZ2/ROLZ3 we have up to 64000 indexes).

Okay, do you know any efficient algorithm for such binary code generation, say for 1, 2 or 4 MB positions?

encoder.Encode(1, (PSCALE / 2)); // store bit 1
encoder.Encode(0, (PSCALE / 2)); // store bit 0  8. Originally Posted by Malcolm Taylor
...
These offsets can be encoded in much the same way as with most lz77,
with an exponential type encoding - encode a bucket number, and then
encode the position in the bucket. Bucket sizes increase for each bucket
number (not necessarily linearly). Check out CAB offset encoding,
deflate and 7zip which all use variations on this. 9. What Malcom writes is about the same. The Put(..) in my example codes the bin/bucket-number. store_bits(..) codes the position in the Bucket. Just look at the example again.  Originally Posted by encode
I use a binary arithmetic encoder. What do you think
Just make the number of bins (40 in the example) a power of two and use the binary arithmetic coder on the bin-number alphabet. 10. Thank you! Just tested!

However, something wrong here - my 20-bit encoding works better than this bin coding... I think I must make more experiments...  11. encodiing distances:

classical reading in this area is APPNOTE.TXT, method 8 (and 6). russian translation: zip.doc in http://www.haskell.org/bz/lzh.7z . then you can read cabarc format explanation LZXFMT.DOC, it's again very helpful

now you can start groking sources, such as trees.c from zip or huf.c from ar002 (it was first compressor invited blockwise static huffman)

fortunately, you don't need to reimplement this all yourself - it's better to reuse wheel instead of reinventing it just look how these procedures are used in calling modules and you are here. in particular, you may get modified trees.c from my arjz. cabarc-compatible routines are also available at special request alternatively, you can got these routines from 7zip sources - ari, huffman, anything you may imagine to be exact, i say about whole procedures implementing block-huffman encoding. if you need only good encoding for distances, look here (code from arjz):

#define D_CODES (REPDIST_CODES+60)
/* number of distance codes */

local int near extra_dbits[D_CODES] /* extra bits for each distance code */
= {0,0,0,0, // REP_DIST codes
0,0,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,
9,9,10,10,11,11,12,12,13,13,14,14,
15,15,16,16,17,17,18,18,19,19,20,20,21,21,22,22,23 ,23,24,24,
25,25,26,26,27,27,28,28};

#define DIST_CODE_SIZE (1024+4096)

local uch dist_code[DIST_CODE_SIZE];
/* distance codes. The first 512 values correspond to the distances
* 3 .. 514, the last 256 values correspond to the top 8 bits of
* the 16 bit distances.
*/

local uint near base_dist[D_CODES];
/* First normalized distance for each code (0 = distance of 1) */

#define d_code(dist)
((dist) < 512 ? dist_code[dist] :
((dist) < 512*256 ? dist_code[512+((dist)>> ]
: dist_code[1024+((dist)>>16)]))
/* Mapping from a distance to a distance code. dist is the distance - 1 and
* must not have side effects. dist_code, , ,  are never
* used.
*/

This code initializes distance mapping arrays

/* Initialize the mapping dist (0..64K) -> dist code (4..63) */
dist = 0;
for (code = REPDIST_CODES; code < REPDIST_CODES+16+2; code++) {
base_dist[code] = dist;
for (n = 0; n < (1<<extra_dbits[code]); n++) {
dist_code[dist++] = (uch)code;
}
}
Assert (dist == 512, "ct_init: dist != 512");
dist >>= 8; /* from now on, all distances are divided by 256 */
for ( ; code < REPDIST_CODES+32+2; code++) {
base_dist[code] = dist << 8;
for (n = 0; n < (1<<(extra_dbits[code]- ); n++) {
dist_code[512 + dist++] = (uch)code;
}
}
Assert (dist == 512, "ct_init: 512+dist != 1024");
dist >>= 8; /* from now on, all distances are divided by 65536 */
for ( ; code < REPDIST_CODES+28*2; code++) { // distances up to 256M
base_dist[code] = dist << 16;
for (n = 0; n < (1<<(extra_dbits[code]-16)); n++) {
dist_code[1024 + dist++] = (uch)code;
}
}
Assert (dist == 4096, "ct_init: 1024+dist != 5120");

This code encodes distance:

/* Here, dist is the match distance - 1 */
code = d_code(dist);
Assert (code < D_CODES, "bad d_code");
send_code(code, dtree); /* send the distance code */

extra = extra_dbits[code];
if (extra != 0) {
dist -= base_dist[code];
send_bits(dist, extra); /* send the extra distance bits */
} 12. Originally Posted by huf.c
static void encode_p(uint p)
{
uint c, q;

c = 0; q = p; while (q) { q >>= 1; c++; }
putbits(pt_len[c], pt_code[c]);
if (c > 1) putbits(c - 1, p & (0xFFFFU >> (17 - c)));
}
By the way, Im greatest fan of Haruhiko Okumura! One question about arjz implementation - how to simplify things?

From ar (huf.c) we can see we need only two tables. These tables filled somethere.

So even, as I understood, no reason to encode BIN with arithmetic encoder - we encode distance entirely with blockwise static huffman.

2 Bulat:
Dou you know the simplest/smallest code to encode distances? Unfortunately, Im not so familar with Huffman encoding - the idea is very simple, and I, of course, understand the principles, however, I havent my own Huffman coder...  13. Fortunately, the LZ77 is well known area. So I think, the distance coding from best LZ77 coders like cabarc or LZMA will be the best for my shot.

Also, we must drop short matches at long distances. Probably even at matching.

Bulat, do you know the efficient formula to:
1. Checking during matching - say we have match length of 3 at 512. After that we find the match length of 4 at, say, 65536. Of course we must choose the first match.
2. Dropping short matches at large distances. Say, after search we have found match of length 5 at 512 MB distance - we must drop it...

And again, well known schemes must work with my LZ.  14. as you see in unrar sources, len=2 dropped with dist>256, 3 - 4096 and 4 - 256k (and their codes are reused for larger strings). for first question, the heurestic you can find in arjz sources is the:

if (new_match_len==old_match_len+1 && new_dist>old_dist*64)
drop_new_match

d_code macro replaces this slow cycle:
c = 0; q = p; while (q) { q >>= 1; c++; }

extra_dbits array allows to have flexible distances mapping while in ar002 number of extra bits is always c-1 15. after all, idea of logarithmic representation is very simple: you map distance into some small enough code and calculate number of lower bits you should sent non-encoded

in ar002, code is just logb of value and additionally all bits except for highest one should be sent non-encoded

in zip, there are two codes per each power of two (read zip's description!). this code does the same encoding without using tables:

#include <stdio.h>

int main(void)
{
for ( int i=0; i<2000; i++)
{
int t=i,n;
for(n=0; t>3; n++,t/=2);
printf ("%4d %4d %4d
", i, n*2+t, n);
}
return 0;
}

it prints encoded value (1-2000), its code and amount of additional bits 16. Thank you Bulat! lzari.zip (20 KB)

A fun art from me! Super draft version with poor compression and speed... However, it's works.

Also, I think I should look into the BWT - I have an idea to make something like a fast-BBB. A BWT with strong output coding - like bit-oriented arithmetic or order-0 PAQ-inspired encoder...  17. Results on my testset:
all 15.812.052, compression 908kB/s, decompression 4000kB/s
exe 747.792
dll 1.641.758 18. Thanks BF! Note that LZARI has no E8/E9 transformer and uses Greedy Parsing... 19. Arithmetic gives higer compression compared to the huffman. Indeed, I tried to implement this static huffman for distances and looks like my approach works better - i.e. coding the 20-bit value via bit-oriented arith. Probably if I initialize counters properly, this will give another boost.

But one thing I cannot understand, why performance is so poor? (is equal to Deflate).

Probably the key is in inefficient LZ-output coding. Or I implemented wrong static blockwise huffman (I tried compresison only).  20. I believe that my match finder finds all matches. Plus tricky sliding window is also works...  21. Originally Posted by encode
A fun art from me! Super draft version with poor compression and speed... However, its works
Thanks Ilia!  22. Originally Posted by encode
But one thing I cannot understand, why performance is so poor? (is equal to Deflate).
i should see?  23. I expect that this scheme will give higher compression...

However, playing with the baseline LZ77 I get some ideas for ROLZ compression...  24. if you thought that you can easily get lzma-like results - sorry, it uses a lot of tricks. seems that you still dont read even zip docs? Originally Posted by encode
However, playing with the baseline LZ77 I get some ideas for ROLZ compression...
i also thinks that there are still enough room for improvement. how about some combined scheme, that uses lz77 for small near strings? or rk-like scheme where search performed in large buffer of strings with the same order-1 context? 25. 26. Originally Posted by Bulat Ziganshin
seems that you still dont read even zip docs? Originally Posted by Black_Fox
Be sure, I read ALL docs.

Read something and understand is one thing, write own program is different one. Thats why I just ask people like Bulat about pure LZ77 and LZH. Currently, Im writing a new experimental compressor. With one thing in mind - get a higher binary compression with simple decoder. I hope the first public version will be available soon. I will keep name of the compressor in secret, but I can say I got some interesting results, even compared to the QUAD.  27. A bit offtopic maybe - how difficult is it to make a run-time packer? Will you try it when you get LZARI to compress better?  28. Originally Posted by Black_Fox
A bit offtopic maybe - how difficult is it to make a run-time packer?
You mean EXE-packer? Or super-fast compressor? Im writing a new engine from scratch!  29. I mean exe/dll packer  30. Originally Posted by Black_Fox
I mean exe/dll packer
No, I havent such plans. The main reason is the new engine stays in very early stage of development - its a new big trip...  lz, lz77, lzari 