# Thread: Grouped (ROLZ) LZW compressor

1. ## Grouped (ROLZ) LZW compressor

Hi all, this is the first time I write on this list. I would like to present Glue, a compressor I had wrote:
it combines the LZW method with a prediction on a sorted dictionary, what on LZSS it is termed ROLZ.
The source code in Pascal could be found at
http://digilander.libero.it/borsa77
under the link Lempel-Ziv arc. The executable is compiled with the FPC for the DOS and in the same archive
there's the old work of Okumura and Yoshizaki.
As stated by Ross Williams in the explanatory paper for his LZRW4, the dictionary is splitted into 256 groups
regulated by the word initial character, where each group has a size of 255 items; this is aimed to reduce pointers
size and the actual group prediction is attempted for the orders 1-0.
In the order 1 for every group a recency rank list is maintained, the coding is performed by Elias gamma. The
list length is quite short and can be set by the constant Span, characters will never be inserted at this
order. There are two possible escape codes to occur: either the symbol is a string or a character. In this case
the order 0 takes care to code them with dynamic Huffman, but omitting the pointer if it is a character.
In the order 1 the probabilities for the escape codes are adjusted with the move-to-front method, while for
symbols a one-step shift is used. A nice feature of the LZW dictionary is that whereas it is managed with a trie,
every item in it holds a successors list which could be used as an esclusion list for the prediction (Horspool).
As shown in the following example

Dict: ab,abc,de
Input: ab|de
Order 1 rank for b: (0)c,(1)d

the code for d could be 0 instead of 1, it should be noted this approach works only with greedy parsing.
The frequencies of Huffman algorithm by Vitter are never scaled down, this global modelling juxstaposed to
the local one imposed by the Span constant could be useful to handle a large spectrum of possible inputs.
Because of lack of correlation between groups, the relative pointers are not coded, so to make up for, when a group
becomes full, the ideal strategy would be to find the least recently used leaf (Storer). Since this requires a dedicated
data structure, I turned instead on the shrink method (PkWare, British Telecom) where all leaves are pruned, with
the little refinement to limit their number to 32 to avoid the waste of one bit in the pointer.
My goal in developing this program was to reach the performance of the deflate algorithm with a comparable dictionary
size, in respect to speed I was resigned by the need of dynamic entropy coder for the PPM, for compression ratio Glue
does worse by a factor of about 10%. I am not satisfied from this result, but I think the program may, at least, be didactic,
in fact for what I know there's no other example of a grouped LZW.
Greetings, Marco Borsari.

2. ## Thanks:

Simorq (6th April 2018)

3. Code:
```Usage :
To append some files : LHARC A arc f1..fn
To recurse into a directory : LHARC R arc path
To extract from an archive  : LHARC [E|X] arc```
Enwik8 Result 39.608.932 Bytes
- Encoding fast (<10sec)
- Decoding very fast (~1sec)

Win32 Build

4. ## Thanks:

comp1 (8th May 2015)

5. To msg #2:
Please note this version of lharc to handle files and directories makes use of the Dos unit,
to compile with Win32 one should use instead the SysUtils unit and remap the calls to procedures.

6. What is your use case in developing just for dos? In this forum you wont generate much interest or test willing users without a windows/unix build.
But at leats this fast compilation still works in never windows versions with limited dos integration.

7. Testing in 32 bit Vista, 2.0 GHz T3200, 3 GB:

Code:
```C:\tmp>timer32 glue e \res\enwik9 enwik9.glue
Not available nodes for reuse

Exit code: 1

Global Time  =   176.656 =  100%```
enwik8 compression is OK: 44,876,476, 38.9s, 31.5s

8. Originally Posted by Simon Berger
Win32 Build
Compressed uncompressed 6MB PNG very fast, and extracted okey.
Also okey with 2 textfiles (less than 10MB), separately.

But when I tried 2.7MB XLS I got constant 13% cpu (1 thread), program locked. Aborted after 1 minute. Got 1 file, about 875KB, unable to extract.
Created a new XLS with content of "dir d:\ /s /a > 1.txt" and copy/paste into XLS, saved. Enter "lharc a test test.xls" and lharc does compress 4MB xls to 1MB, but lharc does not exit. After a few seconds I press ctrl-c. Compressed file is missing header, first 32 chars of compressed file is 00h, the rest could be lh5-stream.

Compressing from folder p\ zpaq64.v7.05.exe from 657920 to 262256 but unable to extract:
Z:\>lharc.exe a zp p\zpaq64.exe
Z:\>lharc.exe x zp.lzh = Header or compression method unsupported.

Compressing without folder
Z:\>lharc.exe a zp zpaq64.exe
Z:\>lharc.exe x zp.lzh = zpaq64.exe CRC : 5a84 = okey!

Also lharc does not support more than 8.3 in filename.

9. To msg #4:
That version of lha was designed for his original system, ie dos. Glue can be compiled on any system as it doesn't have any archive managing stuff.

10. To msg #5:
Not available nodes for reuse is a condition which happens when a group becomes full and among nodes there's only one leaf, but it is the one is coming to be a prefix for the new dictionary item. In such a condition it is not possible to recycle a node for the new item. Typically this is the case when the input stream presents a repeated sequence that fills all the 255 slots of a group. A naive solution could be an RLE as a first filter, but it would not save where the sequence be abab... A better solution should be to limit the maximun length of the dictionary items.

11. To msg #6:
From errors as the first 32 characters blank or Header or compression method unsupported, I guess the problem could be incompatibility using the dos unit when compiling for win. In particular I think the major issue is the procedures to set/write file attributes in the presence of extended attributes in win.

12. I fixed the problem showed in the msg #5, now Glue should work even with files with unlimited sequences,
the solution adopted is not between those in msg #8 but instead it stops the dictionary to add new entries
until the symbol found is not the only leaf anymore.
Also fixed a bug when the repeated sequence appears at the begin of a file.
Since I am away from home I can not update my site with the Dos executable, so I put here a Linux one.

13. I managed Glue to be able of Flexible Parsing (Horspool, Mathias-Rajpoot-Sahinalp). It looks only at the
first character to eventually add the new entry to the dictionary, instead to search on the entire word for
the character that could lead to an absent entry.
I found this approach does not compromise the rate of learning of the dictionary provided that for the same
sum of the lengths of the two words the parser prefers to preserve the first one.
To verify try to change in the attached source the line 752

IF U+V>=Edge THEN

into

IF U+V>Edge THEN

It is a semi-quadratic time algorithm, with n<256.
For the nature of FP I have to remove the exclusion list at the order 1 (see msg #1), furthermore the present version
has only one escape code (because single character should be rare) and the clearing of a group does not have
a limit for the pruned leaves.
The comparisons over the previous version show that for text file the compression ratio is approximately equal
but for binary it is substantially worse

moby.txt original size 1215756
lena.bmp original size 786486
glue5 moby.txt compressed size 563663
glue6 moby.txt compressed size 568713
glue5 lena.bmp compressed size 690838
glue6 lena.bmp compressed size 754546

14. A new version of glue6 is ready to be released (there's no minor number).
I replaced the signal for an unknown character (the dictionary index 256) with a real escape code,
thought it is freezed in the last available position. Further coding of the prediction is made by unary
The comparison over glue5 becomes

moby.txt original size 1215756
lena.bmp original size 786486
glue5 moby.txt compressed size 563451
glue6 moby.txt compressed size 556197
glue5 lena.bmp compressed size 690879
glue6 lena.bmp compressed size 723198

15. I enhanced Glue5 to include the handling of a compact trie for those
sequences which end with a leaf. This theoretically allows to increase
the dictionary size with the same bit symbol space of an equivalent LZW.
To ensure an efficient use of the space the compacted sequences are
stored as a sparse linear list. Their creation and destruction takes place
only when a group is full and needs to be cleared, where they remain
otherwise fixed in the normal LZW mode. Stated the lack of children
proliferation of such a sequence I added a score system to tune their
permanence in the dictionary.
At one moment a node of the trie can have either an ordinary child
either one that must be extended with a fixed string, so the trie grows
in a reversed lexicographical order in a way the latter be found first.
Because of the above feature, special attention must be taken in order
to keep using the exclusion lists for the order 1 prediction.
All the tests show that the rise of overlap induced to the LZW by my
own kind of string compaction is more relevant of what it saves:

moby.txt -> 565256
lena.bmp -> 692402

Source and Linux executable

16. Is any chance to Win64 executable?

17. @msg#14:
I am sorry, no. I am working far from home and here I have only a Linux machine.
Anyway I could just compile for Win32 but in a month.
PS - I moved my personal site to http://digilander.libero.it/oldpumpkin

18. ## Thanks:

Darek (28th March 2017)

19. Here the Win32 binary of Glue8, it has two minor changes that make it incompatible with the previous version:
- the score value of an extended string experiences a decay every time the Clear routine is called;
- a pointless limitation on the extended string creation was removed.

20. ## Thanks:

spark (15th April 2017)

21. I corrected the behavior of Glue8 when in the Clear routine it removes an useless extended string,
now even the leaf will be removed to avoid to break the reversed lexicographical order in the trie,
and so to interfere with the parsing.
Here the Linux binary, the Win32 one will be available from two weeks on my site (see post #15).

22. Because Glue6 does not need in his prediction section to couple with the LZW trie for
the absence of the esclusion list, I switched from order 1 to order 2.
I would appreciate comments about my choice of the hash function, I suspect
it could be quite weak.
Please note that the previous version of Glue6 posted here in msg #12 it is bugged
as the Decode omits to set the escape code in the rank when Empty is nonzero.
Results are listed below:

moby.txt -> 549789
lena.bmp -> 720313

23. ## Thanks (2):

load (1st September 2017),Simorq (6th April 2018)

24. I slightly modified the previous hash function and made some test with the common one:
Code:
```FUNCTION Hash(Idx:Word):Word;
VAR
C1,C2:Cardinal;
BEGIN
IF Idx>0 THEN
Dec(Idx)
ELSE
Idx:=Group2;
C1:=Shaker[Idx];
IF Idx>0 THEN
Dec(Idx)
ELSE
Idx:=Group2;
C2:=Shaker[Idx];
C2:=C2*123456791;
C1:=(C2+C1)*123456791;
Hash:=C1 SHR 20;
END;```
Results are comparable, i.e. it is worse on binary and better on text (!), so I retain my own hash.

25. ## Thanks:

Simorq (6th April 2018)

26. I have now inserted for Glue6 the LRU list to recycle a dictionary item, and found that
Storer's style of placing it before his ancestor is detrimental for the compression ratio.
New results are:

moby.txt -> 541970
lena.bmp -> 703250
enwik8 -> 41880418

The last file is close (actually slightly better) than Muraviev's lzw 0.2

27. ## Thanks:

Simorq (6th April 2018)

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•