Results 1 to 9 of 9

Thread: NLZM 1.03

  1. #1
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    162
    Thanks
    39
    Thanked 54 Times in 30 Posts

    NLZM 1.03

    New graph parser, other general improvements and tidying:
    enwik8.txt to 25,056,894 bytes and decompresses in 0.81 seconds.
    enwik9.txt to 206,548,020 bytes and decompresses in 7.16 seconds.

    +rolz:
    enwik8.txt to 23,009,918 bytes and decompresses in 1.30 seconds.

    The new graph parser with early exits makes it practical (previous links to usually just a few undecided tokens) to do optimal per-word or per-character ROLZ (quick test: search match table offsets in the ROLZ table, then replace LZ distance with ROLZ offset), but I couldn't get a better improvement than 4-5% on general text data, not worth the extra memory usage for me (16k-64k * 256). There are probably some better strategies like a LRU and sliding out less hit entries to build a dynamic dictionary. Commented out if anyone would like to attempt it. The rest of the model and adaptation parameters are open to optimization for different types of data.

    Attached executable (clang 10)
    Source: https://github.com/nauful/NLZM
    Attached Files Attached Files
    Last edited by cade; 10th November 2020 at 18:47. Reason: Added rolz experiment

  2. Thanks:

    Nania Francesco (8th November 2020)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,013
    Thanks
    303
    Thanked 1,328 Times in 759 Posts
    Code:
    Z:\tmp5>lzma.exe e -d27 -fb273 -lc8 -lp0 -pb0 enwik8 enwik8.lzma
    
    LZMA 19.00 (x86) : Igor Pavlov : Public domain : 2019-02-21
    
    Input size:  100000000 (95 MiB)
    Output size: 24582709 (23 MiB)

  4. #3
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    162
    Thanks
    39
    Thanked 54 Times in 30 Posts
    With the same default parameters (nice match length of 64) and not modelling o1 literals anymore, this is the closest setting for LZMA:
    Code:
    .\lzma.exe e -d27 -fb64 -lc0 -lp0 -pb0 enwik8.txt out8a > out8a.txt
    Output size: 24902880 (23 MiB)
    I wonder how much of that is due to literal exclusion after match and state index modelling. Will do a few more experiments.

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,013
    Thanks
    303
    Thanked 1,328 Times in 759 Posts
    There's no true literal exclusion in lzma - it encodes shorter matches than possible sometimes
    (ie greater length could be used with same distance).

    As to "state" - its a quantization of most recent token types, mostly it remembers
    the match type if previous token was a match (match/rep-match/rep-literal)
    and the number of literals since last match (up to 3 or so).

  6. #5
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    162
    Thanks
    39
    Thanked 54 Times in 30 Posts
    Added a small experiment with o1 ROLZ, 64k positions per previous byte. It replaces LZ matches with equivalent ROLZ matches, if found, as (dictionary distance, match length) -> (rolz depth, 2*abs(match length - previous word length) + (match length < previous word length)) if the cost of the latter is smaller. Still thinking of a faster and better way to search ROLZ because cost includes both depth and length difference.

    enwik8.txt to 23,009,918 bytes and decompresses in 1.30 seconds. Executable in the first post.

    Edit: With more redundant classic ROLZ data e.g. FP.log, 20,617,071 -> 588,072 and decompresses in 0.06 seconds.

  7. Thanks (3):

    algorithm (10th November 2020),Bulat Ziganshin (11th November 2020),Nania Francesco (11th November 2020)

  8. #6
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,580
    Thanks
    230
    Thanked 160 Times in 90 Posts
    Really interesting as a compressor both LZ77 and ROLZ. I am working both on an archiver (also solid) that uses LZ77, one that uses ROLZ and finally another version that uses a hybrid between the two systems and I am very interested in a collaboration. However, at moment, I don't think Open Source is a viable solution for my programs.

  9. #7
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    162
    Thanks
    39
    Thanked 54 Times in 30 Posts
    I'm always happy to talk about compression. No obligation to open source anything I contribute to, which is one of the reasons why I mostly release (my own) programs into the public domain.

  10. #8
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,580
    Thanks
    230
    Thanked 160 Times in 90 Posts
    You could explain better how the trial version works with Rolz since I didn't find a source of this experimental compressor.

  11. #9
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    162
    Thanks
    39
    Thanked 54 Times in 30 Posts
    In the version uploaded:
    Code:
    const static int TABLE_BITS = 16;
    
    uint32 prev_pos[0x100][1 << TABLE_BITS];
    uint16 prev_lens[0x100][1 << TABLE_BITS];
    uint32 head[0x100];
    ROLZ offset and length are both written as a CDF to give log base and raw bits for exp: (base = log2(value), value - 2^base). Lengths are encoded the same as normal LZ lengths. In addition, lengths are context coded by log2(prev_len), so we have CDF log_rolz_offset and CDF4 log_rolz_len_hi[8] + CDF4 log_rolz_len_lo[8][16]

    Insert into ROLZ at every word (e.g. literal is (pos, 0), match is (dest pos, match len). I run the same matchfinder as LZ but with a different acceptance criteria: instead of taking longest, it checks, for len = 2 to matched if the cost of (rolz depth, len diff) is cheaper than the current cost (normal LZ for the same match length is always cheaper for smaller offsets). Including the previous byte in the match test (i.e. check pos - 1, match length is -1 afterwards) means there's only one match finder for the whole system, not one per context.

    Improvements after:
    - Simple mistake, forgot to context-code ROLZ offsets too (previous byte), so we now have CDF log_rolz_offset[0x100] and CDF log_rolz_len[8]
    - Switched length encoding from flat to log-exp (same as offset as described above).
    This reaches 22,728,933 on enwik8, however the cost checks for every length of every match from the binary tree matchfinder (not just if match length is longer).

    Thinking about:
    - Greatly reducing memory. Most of the cells are empty, so a scheme like this:
    Code:
    // 64k
    const int CELL_SIZE = 256;
    const int LUT_SIZE = 256;
    
    struct Cell {
    uint32 prev_pos[CELL_SIZE]
    uint16 prev_len[CELL_SIZE];
    };
    
    struct CellLUT {
    uint32 cells[LUT_SIZE];
    };
    
    struct RolzTable {
    CellLUT lut[0x100];
    uint32 start[0x100], head[0x100];
    Cell *cells;
    };
    
    // Some scheme to assign cells and increment starts to drop old cells when available cells run low.
    Nearly all, for some definition of "nearly all", matches are within 4k entries, so a bounded memory scheme without limiting every previous byte context to 4k will keep ROLZ memory usage more reasonable (<10 MB).

    - A more accurate matcher instead of the classic increasing-length-only. A hash chain (instead of binary tree) match finder reaches 22,274,413, but runs overnight.

  12. Thanks (3):

    Bulat Ziganshin (12th November 2020),Nania Francesco (12th November 2020),Sergey3695 (14th November 2020)

Similar Threads

  1. NLZM - Semi-optimal not-really-LZMA compressor
    By cade in forum Data Compression
    Replies: 41
    Last Post: 20th February 2020, 18:18

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •