Results 1 to 20 of 20

Thread: Grouped (ROLZ) LZW compressor

  1. #1
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts

    Grouped (ROLZ) LZW compressor

    Hi all, this is the first time I write on this list. I would like to present Glue, a compressor I had wrote:
    it combines the LZW method with a prediction on a sorted dictionary, what on LZSS it is termed ROLZ.
    The source code in Pascal could be found at
    http://digilander.libero.it/borsa77
    under the link Lempel-Ziv arc. The executable is compiled with the FPC for the DOS and in the same archive
    there's the old work of Okumura and Yoshizaki.
    As stated by Ross Williams in the explanatory paper for his LZRW4, the dictionary is splitted into 256 groups
    regulated by the word initial character, where each group has a size of 255 items; this is aimed to reduce pointers
    size and the actual group prediction is attempted for the orders 1-0.
    In the order 1 for every group a recency rank list is maintained, the coding is performed by Elias gamma. The
    list length is quite short and can be set by the constant Span, characters will never be inserted at this
    order. There are two possible escape codes to occur: either the symbol is a string or a character. In this case
    the order 0 takes care to code them with dynamic Huffman, but omitting the pointer if it is a character.
    In the order 1 the probabilities for the escape codes are adjusted with the move-to-front method, while for
    symbols a one-step shift is used. A nice feature of the LZW dictionary is that whereas it is managed with a trie,
    every item in it holds a successors list which could be used as an esclusion list for the prediction (Horspool).
    As shown in the following example

    Dict: ab,abc,de
    Input: ab|de
    Order 1 rank for b: (0)c,(1)d

    the code for d could be 0 instead of 1, it should be noted this approach works only with greedy parsing.
    The frequencies of Huffman algorithm by Vitter are never scaled down, this global modelling juxstaposed to
    the local one imposed by the Span constant could be useful to handle a large spectrum of possible inputs.
    Because of lack of correlation between groups, the relative pointers are not coded, so to make up for, when a group
    becomes full, the ideal strategy would be to find the least recently used leaf (Storer). Since this requires a dedicated
    data structure, I turned instead on the shrink method (PkWare, British Telecom) where all leaves are pruned, with
    the little refinement to limit their number to 32 to avoid the waste of one bit in the pointer.
    My goal in developing this program was to reach the performance of the deflate algorithm with a comparable dictionary
    size, in respect to speed I was resigned by the need of dynamic entropy coder for the PPM, for compression ratio Glue
    does worse by a factor of about 10%. I am not satisfied from this result, but I think the program may, at least, be didactic,
    in fact for what I know there's no other example of a grouped LZW.
    Greetings, Marco Borsari.
    Last edited by Marco_B; 1st May 2018 at 15:59.

  2. Thanks:

    Simorq (6th April 2018)

  3. #2
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    409
    Thanks
    0
    Thanked 5 Times in 5 Posts
    Code:
    Usage :
    To append some files : LHARC A arc f1..fn
    To recurse into a directory : LHARC R arc path
    To extract from an archive  : LHARC [E|X] arc
    Enwik8 Result 39.608.932 Bytes
    - Encoding fast (<10sec)
    - Decoding very fast (~1sec)

    Win32 Build
    Attached Files Attached Files

  4. Thanks:

    comp1 (8th May 2015)

  5. #3
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    To msg #2:
    Please note this version of lharc to handle files and directories makes use of the Dos unit,
    to compile with Win32 one should use instead the SysUtils unit and remap the calls to procedures.
    Last edited by Marco_B; 8th June 2015 at 10:10.

  6. #4
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    409
    Thanks
    0
    Thanked 5 Times in 5 Posts
    What is your use case in developing just for dos? In this forum you wont generate much interest or test willing users without a windows/unix build.
    But at leats this fast compilation still works in never windows versions with limited dos integration.

  7. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    Testing in 32 bit Vista, 2.0 GHz T3200, 3 GB:

    Code:
    C:\tmp>timer32 glue e \res\enwik9 enwik9.glue
    Not available nodes for reuse
    
    Exit code: 1
    
    
    Global Time  =   176.656 =  100%
    enwik8 compression is OK: 44,876,476, 38.9s, 31.5s

  8. #6
    Member
    Join Date
    Jun 2013
    Location
    Sweden
    Posts
    150
    Thanks
    9
    Thanked 25 Times in 23 Posts
    Quote Originally Posted by Simon Berger View Post
    Win32 Build
    Compressed uncompressed 6MB PNG very fast, and extracted okey.
    Also okey with 2 textfiles (less than 10MB), separately.

    But when I tried 2.7MB XLS I got constant 13% cpu (1 thread), program locked. Aborted after 1 minute. Got 1 file, about 875KB, unable to extract.
    Created a new XLS with content of "dir d:\ /s /a > 1.txt" and copy/paste into XLS, saved. Enter "lharc a test test.xls" and lharc does compress 4MB xls to 1MB, but lharc does not exit. After a few seconds I press ctrl-c. Compressed file is missing header, first 32 chars of compressed file is 00h, the rest could be lh5-stream.

    Compressing from folder p\ zpaq64.v7.05.exe from 657920 to 262256 but unable to extract:
    Z:\>lharc.exe a zp p\zpaq64.exe
    Z:\>lharc.exe x zp.lzh = Header or compression method unsupported.

    Compressing without folder
    Z:\>lharc.exe a zp zpaq64.exe
    Z:\>lharc.exe x zp.lzh = zpaq64.exe CRC : 5a84 = okey!

    Also lharc does not support more than 8.3 in filename.

  9. #7
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    To msg #4:
    That version of lha was designed for his original system, ie dos. Glue can be compiled on any system as it doesn't have any archive managing stuff.
    Last edited by Marco_B; 8th June 2015 at 10:11.

  10. #8
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    To msg #5:
    Not available nodes for reuse is a condition which happens when a group becomes full and among nodes there's only one leaf, but it is the one is coming to be a prefix for the new dictionary item. In such a condition it is not possible to recycle a node for the new item. Typically this is the case when the input stream presents a repeated sequence that fills all the 255 slots of a group. A naive solution could be an RLE as a first filter, but it would not save where the sequence be abab... A better solution should be to limit the maximun length of the dictionary items.
    Last edited by Marco_B; 12th October 2015 at 15:16.

  11. #9
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    To msg #6:
    From errors as the first 32 characters blank or Header or compression method unsupported, I guess the problem could be incompatibility using the dos unit when compiling for win. In particular I think the major issue is the procedures to set/write file attributes in the presence of extended attributes in win.
    Last edited by Marco_B; 17th June 2015 at 10:53.

  12. #10
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    I fixed the problem showed in the msg #5, now Glue should work even with files with unlimited sequences,
    the solution adopted is not between those in msg #8 but instead it stops the dictionary to add new entries
    until the symbol found is not the only leaf anymore.
    Also fixed a bug when the repeated sequence appears at the begin of a file.
    Since I am away from home I can not update my site with the Dos executable, so I put here a Linux one.
    Attached Files Attached Files
    Last edited by Marco_B; 8th April 2020 at 16:13.

  13. #11
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    I managed Glue to be able of Flexible Parsing (Horspool, Mathias-Rajpoot-Sahinalp). It looks only at the
    first character to eventually add the new entry to the dictionary, instead to search on the entire word for
    the character that could lead to an absent entry.
    I found this approach does not compromise the rate of learning of the dictionary provided that for the same
    sum of the lengths of the two words the parser prefers to preserve the first one.
    To verify try to change in the attached source the line 752

    IF U+V>=Edge THEN

    into

    IF U+V>Edge THEN

    It is a semi-quadratic time algorithm, with n<256.
    For the nature of FP I have to remove the exclusion list at the order 1 (see msg #1), furthermore the present version
    has only one escape code (because single character should be rare) and the clearing of a group does not have
    a limit for the pruned leaves.
    The comparisons over the previous version show that for text file the compression ratio is approximately equal
    but for binary it is substantially worse

    moby.txt original size 1215756
    lena.bmp original size 786486
    glue5 moby.txt compressed size 563663
    glue6 moby.txt compressed size 568713
    glue5 lena.bmp compressed size 690838
    glue6 lena.bmp compressed size 754546
    Attached Files Attached Files
    Last edited by Marco_B; 26th September 2017 at 12:09.

  14. #12
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    A new version of glue6 is ready to be released (there's no minor number).
    I replaced the signal for an unknown character (the dictionary index 256) with a real escape code,
    thought it is freezed in the last available position. Further coding of the prediction is made by unary
    instead of Elias.
    The comparison over glue5 becomes

    moby.txt original size 1215756
    lena.bmp original size 786486
    glue5 moby.txt compressed size 563451
    glue6 moby.txt compressed size 556197
    glue5 lena.bmp compressed size 690879
    glue6 lena.bmp compressed size 723198
    Attached Files Attached Files

  15. #13
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    I enhanced Glue5 to include the handling of a compact trie for those
    sequences which end with a leaf. This theoretically allows to increase
    the dictionary size with the same bit symbol space of an equivalent LZW.
    To ensure an efficient use of the space the compacted sequences are
    stored as a sparse linear list. Their creation and destruction takes place
    only when a group is full and needs to be cleared, where they remain
    otherwise fixed in the normal LZW mode. Stated the lack of children
    proliferation of such a sequence I added a score system to tune their
    permanence in the dictionary.
    At one moment a node of the trie can have either an ordinary child
    either one that must be extended with a fixed string, so the trie grows
    in a reversed lexicographical order in a way the latter be found first.
    Because of the above feature, special attention must be taken in order
    to keep using the exclusion lists for the order 1 prediction.
    All the tests show that the rise of overlap induced to the LZW by my
    own kind of string compaction is more relevant of what it saves:

    moby.txt -> 565256
    lena.bmp -> 692402

    Source and Linux executable
    Attached Files Attached Files
    Last edited by Marco_B; 16th April 2018 at 14:24.

  16. #14
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,152
    Thanks
    704
    Thanked 455 Times in 352 Posts
    Is any chance to Win64 executable?

  17. #15
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    @msg#14:
    I am sorry, no. I am working far from home and here I have only a Linux machine.
    Anyway I could just compile for Win32 but in a month.
    PS - I moved my personal site to http://digilander.libero.it/oldpumpkin

  18. Thanks:

    Darek (28th March 2017)

  19. #16
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    Here the Win32 binary of Glue8, it has two minor changes that make it incompatible with the previous version:
    - the score value of an extended string experiences a decay every time the Clear routine is called;
    - a pointless limitation on the extended string creation was removed.
    Attached Files Attached Files

  20. Thanks:

    spark (15th April 2017)

  21. #17
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    I corrected the behavior of Glue8 when in the Clear routine it removes an useless extended string,
    now even the leaf will be removed to avoid to break the reversed lexicographical order in the trie,
    and so to interfere with the parsing.
    Here the Linux binary, the Win32 one will be available from two weeks on my site (see post #15).
    Attached Files Attached Files
    Last edited by Marco_B; 8th April 2020 at 16:12.

  22. #18
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    Because Glue6 does not need in his prediction section to couple with the LZW trie for
    the absence of the esclusion list, I switched from order 1 to order 2.
    I would appreciate comments about my choice of the hash function, I suspect
    it could be quite weak.
    Please note that the previous version of Glue6 posted here in msg #12 it is bugged
    as the Decode omits to set the escape code in the rank when Empty is nonzero.
    Results are listed below:

    moby.txt -> 549789
    lena.bmp -> 720313
    Attached Files Attached Files
    Last edited by Marco_B; 22nd September 2017 at 10:45.

  23. Thanks (2):

    load (1st September 2017),Simorq (6th April 2018)

  24. #19
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    I slightly modified the previous hash function and made some test with the common one:
    Code:
    FUNCTION Hash(Idx:Word):Word;
    VAR
      C1,C2:Cardinal;
    BEGIN
      IF Idx>0 THEN
        Dec(Idx)
      ELSE
        Idx:=Group2;
      C1:=Shaker[Idx];
      IF Idx>0 THEN
        Dec(Idx)
      ELSE
        Idx:=Group2;
      C2:=Shaker[Idx];
      C2:=C2*123456791;
      C1:=(C2+C1)*123456791;
      Hash:=C1 SHR 20;
    END;
    Results are comparable, i.e. it is worse on binary and better on text (!), so I retain my own hash.
    Attached Files Attached Files
    Last edited by Marco_B; 5th April 2018 at 14:13.

  25. Thanks:

    Simorq (6th April 2018)

  26. #20
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    I have now inserted for Glue6 the LRU list to recycle a dictionary item, and found that
    Storer's style of placing it before his ancestor is detrimental for the compression ratio.
    New results are:

    moby.txt -> 541970
    lena.bmp -> 703250
    enwik8 -> 41880418

    The last file is close (actually slightly better) than Muraviev's lzw 0.2
    Attached Files Attached Files
    Last edited by Marco_B; 5th May 2020 at 09:38.

  27. Thanks:

    Simorq (6th April 2018)

Similar Threads

  1. BALZ - An Open-Source ROLZ-based compressor
    By encode in forum Data Compression
    Replies: 60
    Last Post: 6th March 2015, 16:47
  2. A nooblike ROLZ compressor
    By RichSelian in forum Data Compression
    Replies: 11
    Last Post: 10th October 2012, 23:13
  3. ROLZ and Search Trees ?
    By Guzzo in forum Data Compression
    Replies: 5
    Last Post: 1st August 2012, 00:03
  4. xp a rolz compressor
    By pmcontext in forum Data Compression
    Replies: 40
    Last Post: 9th December 2010, 09:04
  5. ROLZ explanation?
    By Trixter in forum Data Compression
    Replies: 5
    Last Post: 10th June 2008, 18:24

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •