Page 1 of 2 12 LastLast
Results 1 to 30 of 42

Thread: NLZM - Semi-optimal not-really-LZMA compressor

  1. #1
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts

    NLZM - Semi-optimal not-really-LZMA compressor

    This is NZLM (not-really-LZMA LZM), a general compressor with semi-optimal parsing, LZ (up to 1 GB), o0 + o1 CM and simple probability counters for decisions. I started writing an optimal lz + ari, then switched decisions to a separate probability counter (i.e. PPM escape vs encode decision) instead of a single alphabet for literals + log2 lengths. For distances >= 4 bits, the lower 4 bits are also entropy coded. Current state is determined by the two most recent decisions (literal vs match), and there is a simple rep0.

    I'll clean up the source (jumble of files reused by different experiments), check/fix any bugs, then make it open source. Attachment exe is 64-bit compiled with clang++ 9.0.0 on Windows 8.1

    Source here: https://github.com/nauful/NLZM

    Results (c to compress, cf to compress fast, d to decompress, t to test in-memory decompression and CRC32):
    nlzm -window:27 c enwik8.txt out8
    nlzm -window:30 c enwik9 out9
    enwik8.txt to 25,226,391 bytes and decompresses in 0.95 seconds.
    enwik9.txt to 203,649,613 bytes and decompresses in 8.64 seconds.

    Semi-optimal parsing is done by finding all matches (exhaustively) for the block size, then finding (but not updating search) matches for block size + nice length, then traced by finding the minimum cost from block size + nice length to block start. Not terminating matches at block size allows some (better) parsing decisions when matches could cross block boundaries. Block start is always aligned to block size so that the match finder is updated with every position. When a decision table is calculated for a block, the current costs of encoding (literal cost, match cost, etc) are estimated (log2[prob] bps, ~1-3% error compared to final compressed file size with 8k blocks) and used for path finding.

    There are two optimizations for semi-optimal parsing:
    - In the match finder (forward pass), if a sufficiently long match is found, the next position will try to continue that match or reduce length by one, and ends when length is less than an escape threshold to not spend extra time in the binary tree match finder.
    - In the path finder (reverse pass), a maximum of 8 different lengths forward (step size of (max length - min length) / 8 ) are tested to not waste time exhausting all the options when a few represent most of the optimization.

    To speed up passing incompressible data, if no matches are found for N bytes, then every 2nd byte is tested, 4th, 8th until another match is found and the step size is decreased again. Speed drops off rapidly with larger dictionary sizes and semi-optimal parsing due to the exhaustive match finder.

    Most of my target files are plain-text natural language texts, structured binary data (sensor data, rows and hierarchies of floats, doubles, ints, repetitive strings and labels) between 300 KB and 200 MB, compressed once and decompressed often. Default window size is 1 MB and default memory usage during decompression is not significantly more because my target environment only has ~4-10 MB RAM available for decompression and ~20-32 MB for compression. Memory for compression is 12N bytes (optimal) + IO overhead or 17 MB (fast) + N + IO overhead.

    Disclaimer: Not significantly optimized for speed. Does not compress all strings of N bytes to N - 1 bytes, nor N / 2 bytes. Did not inspire Star Wars. Limited colour scheme options. If disassembled, please provide source with optimization and improvements.
    Attached Files Attached Files
    Last edited by cade; 20th February 2020 at 18:19. Reason: Incorrect order of c and flags in help message, incompressible step size got too large with FlashMX.pdf, fixed bt4 bug

  2. Thanks (6):

    78372 (25th January 2020),Bulat Ziganshin (23rd January 2020),comp1 (23rd January 2020),Mike (23rd January 2020),Shelwien (23rd January 2020),Simorq (26th January 2020)

  3. #2
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    263
    Thanks
    110
    Thanked 151 Times in 111 Posts
    "nlzm c FlashMX.pdf FlashMX.pdf.nlzm" is ok, "nlzm -window:bits=... c FlashMX.pdf FlashMX.pdf.nlzm" crash for any window:bits (even for the default 20).
    ​We should write "[flags] c [input] [output]", not "c [flags] [input] [output]", right?

  4. #3
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts
    Oops, you're correct, typo in the help message. Done and updated.

    The correct syntax is:
    nlzm -window:27 c enwik8.txt out8

    or
    nlzm [flags] c [input] [output]

  5. #4
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    263
    Thanks
    110
    Thanked 151 Times in 111 Posts
    Sorry, you are correct, I wrote "-window:bits=...", not "-window:...".
    Now crash only "nlzm -window:16 c FlashMX.pdf FlashMX.pdf.nlzm".

  6. #5
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts
    Could you share that file with me please, is there a line number and error message? I'll test and fix. It works if you don't specify window size? Do you have this crash with other files too?

  7. #6
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    263
    Thanks
    110
    Thanked 151 Times in 111 Posts
    > Could you share that file with me please
    Sorry, I assumed you knew, it's in the Maximum Compression corpus, downloadable here: https://www.maximumcompression.com/d...s/pdf-test.zip

    > is there a line number and error message?
    No, Windows goes into APPCRASH.
    Window bits: 16
    Calculating crc... 84E9D112
    State: 11 KB
    HR2: 12 KB
    HT3: 128 KB
    BT4: 1313 KB
    Window size: 133 KB
    Parser: 759 KB
    Working...
    The compressed file size is 425.984 (instead of > 3 MB).

    > It works if you don't specify window size?
    Yes, like -window from 17 to 30.

    > Do you have this crash with other files too?
    I tested the 10 files of the Maximum Compression corpus with -window from 16 to 30: only -window:16 with FlashMX.pdf crashes.

  8. Thanks:

    cade (23rd January 2020)

  9. #7
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    876
    Thanks
    80
    Thanked 315 Times in 219 Posts
    enwik10:
    2,882,390,913 bytes, 760.516 sec. - 299.918 sec., nlzm -window:30 cf (compare ok)
    2,399,702,034 bytes, 3,394.103 sec. - 2,732.918 sec., nlzm -window:30 c (compare failed)

  10. Thanks:

    cade (23rd January 2020)

  11. #8
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts
    Thank you both for thorough testing, very helpful.

    Quote Originally Posted by Mauro Vezzosi View Post
    > Do you have this crash with other files too?
    I tested the 10 files of the Maximum Compression corpus with -window from 16 to 30: only -window:16 with FlashMX.pdf crashes.
    Found the problem, the step size for incompressible check was getting too large and overflowed. Limited to not overflow now, tested and it should work now. Updated link in the first post.

    Quote Originally Posted by Sportman View Post
    enwik10:
    2,882,390,913 bytes, 760.516 sec. - 299.918 sec., nlzm -window:30 cf (compare ok)
    2,399,702,034 bytes, 3,394.103 sec. - 2,732.918 sec., nlzm -window:30 c (compare failed)
    Do you have a link to where I can download enwik10 compressed please?

  12. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,710
    Thanks
    270
    Thanked 1,184 Times in 655 Posts

  13. #10
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts
    Found and fixed a bug in how I slide bt4 when it crosses 2GB (2 * window size + max match len + good peek buffer for optimal parsing), will test enwik10 to completion and with just bt4 + memcmp, and then upload a fixed version. This doesn't happen for window bits <= 29.

    For quick verification of other files, you can use t command and it will show expected CRC32 vs CRC32 of the result unpacked in memory. If there aren't any issues after that then I'll clean up and release the source into public domain.

  14. #11
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts
    Quote Originally Posted by Sportman View Post
    enwik10:
    2,882,390,913 bytes, 760.516 sec. - 299.918 sec., nlzm -window:30 cf (compare ok)
    2,399,702,034 bytes, 3,394.103 sec. - 2,732.918 sec., nlzm -window:30 c (compare failed)
    Fixed and updated the exe.

    Also published source now since I did not find any other issues: https://github.com/nauful/NLZM

  15. Thanks (5):

    Bulat Ziganshin (25th January 2020),jibz (25th January 2020),Mike (25th January 2020),Shelwien (25th January 2020),Sportman (25th January 2020)

  16. #12
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    330
    Thanks
    190
    Thanked 54 Times in 38 Posts
    Here's a Win32 exe that works on XP if anyone is interested.
    Attached Files Attached Files

  17. #13
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    876
    Thanks
    80
    Thanked 315 Times in 219 Posts
    Quote Originally Posted by cade View Post
    Fixed and updated the exe.
    Fix verified:

    enwik10:
    2,811,491,748 bytes, 858.957 sec. - 296.051 sec., nlzm -window:30 cf (compare ok)
    1,951,631,481 bytes, 24,754.394 sec. - 273.575 sec., nlzm -window:30 c (compare ok)

  18. #14
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts
    1.01 update:
    ​- Replaced single range coding stream with chunks of two-state rANS + direct bit stream to (~20-30% decompression speed improvement).
    - Chunks are 16 KB, end after roughly 15 KB when direct bits counted + log2[rANS freq] approaches limit.
    - Replaced bit tree with smaller CDF tables (2x4 for byte literals, 2x3 for 64-bit dist slots, etc) (~30-50% decompression speed improvement).
    - SIMD for 3-bit, 4-bit CDF updates and look-up (compile with -DUSE_SSE).
    - Branch free counter (~5% decompression speed improvement)
    - Limit BT4 match attempts to 128 per position.

    If there are no issues, I'll clean up and update the source again in a little bit.

    Quote Originally Posted by comp1 View Post
    Here's a Win32 exe that works on XP if anyone is interested.
    Edit: If needed, the default 1 MB window size uses ~1.4 MB to decompress (probably could be less with a little effort). I can't spare much memory for my use cases.
    Attached Files Attached Files

  19. Thanks:

    JamesB (5th February 2020)

  20. #15
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    175
    Thanks
    49
    Thanked 11 Times in 11 Posts
    @cade
    Here are some ideas from Shelwien related to LZMA.
    Code:
    Anyway, imho there're still many ways to improve lzma (compression-wise).
    
    1. lzma parsing optimization is relatively good, but still far from perfect -
    only certain patterns like match-literal-rep_match are explicitly checked by the optimizer,
    while other ones may be missed.
    I think lzma really needs a matchfinder based on fuzzy matching - and explicit
    price estimation for these, instead of treating rep-matches like something independent.
    
    2. lzma already has 7 token types. Why not add some more useful ones?
    - delta matches, like in bsdiff
    - dictionary matches, where dictionary is not a part of the common window
    - support for multiple data windows (we can have different types of preprocessing in these)
    - multi-dimensional matches (for 2D+ and structured data)
    - LZ78 matches (by index (potentially multiple sets), implicit length)
    - LZP matches (order choice, explicit length)
    - ROLZ matches (contextual distance, explicit length)
    - PPM matches (no distance, unary coding of length, context tracking)
    - long-range matches (like from a rep preprocessor)
    
    3. lzma2 added block support, but there's no proper optimization for these.
    For example, its possible to change lc/lp/pb options per block, but no compressors
    make use of this.
    More ideas could be taken from zstd (RLE blocks, separation of literals/matches).
    Some more can be added after that (rep/MTF/PPM/BWT/huffman blocks, a flag for not adding the block to the window).
    
    4. Secondary LZ. I mainly noticed it when using bsdiff+lzma in an app update system,
    but in some kinds of data there're visible patterns in literals/matches after first LZ pass.
    (Bsdiff used delta-matches, so there were secondary matches within literal blocks too).
    For example, page tags in book1 are encoded by lzma as <match:len=4><literal><rep_match:len=2>,
    and only literal changes there.
    btw, I´m just interested - when optimal parsing will be available?

  21. #16
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    263
    Thanks
    110
    Thanked 151 Times in 111 Posts
    Quote Originally Posted by cade View Post
    1.01 update: [...] If there are no issues, I'll clean up and update the source again in a little bit.
    I have tested all possible combination of -window:, -posbits: and -ctx1: on Maximum Compression corpus and all the decompressed files are identical to the original.

  22. #17
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    175
    Thanks
    49
    Thanked 11 Times in 11 Posts
    Quote Originally Posted by Mauro Vezzosi View Post
    I have tested all possible combination of -window:, -posbits: and -ctx1: on Maximum Compression corpus and all the decompressed files are identical to the original.
    If possible, could you post results for each file? Thanks.

  23. #18
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts
    Quote Originally Posted by CompressMaster View Post
    btw, I´m just interested - when optimal parsing will be available?
    I don't follow the LZMA parser, which can be faster but doesn't calculate "optimal" paths through blocks IIRC. I use a lowest-cost parser from the fuzzy end of a block to the start, then run through it forwards. My model is less effective in some cases where the last symbol or rep4 codes are used in normal LZMA.

    Normal mode c is (almost) optimal parsing, but I hesitate to call it proper optimal:
    1. Repeat matches are evaluated as short-cuts on the best path, not explicitly part of it, and taken if better (i.e. cost_rep[here] + cost_to_end[pos after rep] < cost_to_end[here])
    2. I cache costs every 4-8k, and probabilities are fully adaptive.
    3. I use a log2[freq] (previously log2[prob]) approximation for bit cost (not mathematically optimal).

    https://encode.su/threads/2094-Optim...ll=1#post41746

    I have tried making this closer to optimal with three changes:
    1. Simulate probability adaptation at every step, rewind and check the next position.
    2. Simulate encoding 8k block via greedy parsing, copying the costs, running optimal parsing, simulating encoding via new steps, copying the costs, repeat for 5 passes (1 greedy + 4 optimal with updated costs).
    3. Extend fuzzy block end by 8k so blocks are parsed as 16k and written as the best path for the first 8k. Fuzzy blocks currently end at 8k + 256.

    I don't think it was worth it... 3-5x slower than normal c, 1-3% smaller size on my text files. Small tweaks in the bit cost estimation, CDF update or counter update style (not mathematically optimal) easily go better/worse by another 1-2%, and a fully optimal scheme would also have dynamic values for those.

    I should change the label of "semi optimal" to "mostly optimal but not according to a strictly mathematical definition".

  24. Thanks:

    Bulat Ziganshin (5th February 2020)

  25. #19
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    263
    Thanks
    110
    Thanked 151 Times in 111 Posts
    Quote Originally Posted by CompressMaster View Post
    If possible, could you post results for each file? Thanks.
    Code:
    The best compressed files
       841.649 A10.jpg.nlzm
     1.497.997 AcroRd32.exe.nlzm
       820.496 english.dic.nlzm
     3.773.840 FlashMX.pdf.nlzm
       869.792 FP.LOG.nlzm
     1.884.443 MSO97.DLL.nlzm
       797.983 ohs.doc.nlzm
     1.045.998 rafale.bmp.nlzm
       669.326 vcfiu.hlp.nlzm
       586.012 world95.txt.nlzm
    12.787.536 Total

    I attached the lists of all files sorted by file name+window+posbits+ctx1 and by size in ascending order.
    There are also files for ctx1 = 0 and 1, but NLZM sets ctx1 to at least 2, so they are equal to ctx1 = 2.
    Attached Files Attached Files
    Last edited by Mauro Vezzosi; 5th February 2020 at 19:52. Reason: Changed "posbits" to "ctx1"

  26. Thanks (2):

    cade (5th February 2020),CompressMaster (6th February 2020)

  27. #20
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts
    I think you meant ctxbits were minimum 2. ctxbits were [2,8], posbits were [0,4]. By default, ctxbits are 2, posbits are 0.

    Here is a new exe that changes the ranges so ctxbits are now [0,8] and posbits are [0,4]. However posbits + ctxbits cannot exceed 8 (state needs too much memory during decompression). Decompression is also slightly faster.
    Attached Files Attached Files

  28. Thanks:

    Mauro Vezzosi (6th February 2020)

  29. #21
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    263
    Thanks
    110
    Thanked 151 Times in 111 Posts
    Quote Originally Posted by cade View Post
    I think you meant ctxbits were minimum 2.
    Yes.

    I have tested the last 1.10 version with all possible combination of -window:, -posbits: and -ctx1: on Maximum Compression corpus and all the decompressed files are identical to the original.
    Code:
             c         cf
       841.649    850.027 A10.jpg.nlzm
     1.497.997  1.635.245 AcroRd32.exe.nlzm
       820.496  1.048.387 english.dic.nlzm
     3.773.840  3.826.534 FlashMX.pdf.nlzm
       869.792  1.051.547 FP.LOG.nlzm
     1.884.443  2.057.851 MSO97.DLL.nlzm
       797.983    887.023 ohs.doc.nlzm
     1.045.998  1.218.263 rafale.bmp.nlzm
       669.326    749.915 vcfiu.hlp.nlzm
       586.012    702.347 world95.txt.nlzm
    12.787.536 14.027.139 Total
    
    posbits = 0 always provides the best compression.
    I attached the size lists of all files compressed with "c" and "cf" sorted by file name+window+posbits+ctx1 and by size.
    Attached Files Attached Files

  30. #22
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts
    ​Implementing a match finder for long matches at large distances and thought of something based on Rabin-Karp, for either match finding or larger (read: more serious) deduplication:
    Given a rolling hash of 256b, updating needs only one add + multiply and one subtract + multiply (uint32 overflow can serve as modulus). I can store this rolling hash every 256b into an index for aligned indexes such as 0-255, 256-511, 512-767. Storing is as simple as start_offset[rolling_hash >> (32 - table_bit_size)] = 0, 256, 512. Shorter matches are missed but far fewer entries are needed, and a cache hit is a 256+ byte match.

    If at position 700 the rolling hash in the table has a start offset of 256, and if [256, 511] match [700, 956] then we've found a 256b match. It could also extend both ways e.g. [680, 980], although depending on the parser, it might be too late for the first 20 bytes. In practice the distances are much further than 700-256.

    It's only useful if the maximum length encodable is more than 256 (about 4k in this system), and just using 1 MB table (18 hash bits, 4-byte position) seems to be enough for most large files (GB+). For larger matches at potentially longer distances, just increasing the block size to 1k or 4k seems good. Missing blocks here and there is less important when blocks found can be extended in both directions to cover.

  31. #23
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts
    Updated source on github and added nlzm102.exe to first post.

    1.02 update:
    Added long-range matchfinder (RK256, Rabin-Karp with 256 byte indexing and rolling matcher).
    Simplified probability-coded decision sequences with CDFs.
    Replaced literal/match decision with commands (unlimited operation types vs just two).
    Improved cost estimation, rep model.
    Started on forward-pass optimal parse, to eventually replace reverse optimal parse, for better handling stateful models (e.g. better rep, LZP, ROLZ).

    Also experimented with ROLZ, which improved compression by 2-10% on text files (24,127 for enwik8 and better on smaller files: only store offset for individual literals or start of match, not every byte, forward optimal parse instead of the current reverse) but added too much overhead for now (memory and speed).

    Forward optimal tries to improve on LZMA's pattern matching parser and the fast LZSS style reverse parser by checking all commands with carrying forward states: https://github.com/nauful/NLZM/blob/...NLZM.cpp#L2716
    Currently 0.5-1% better than the faster backwards parser, but not optimized at all yet.

    Next thoughts:
    - Continue forward-pass optimal parser that can carry state through, such as the rep state right now (reverse parser has static state cached every 8k blocks).
    - Dynamic range coding was a bad idea for speed, FSE chunks for each code type (cmd, literal, distance, etc) would be faster. Simple o0 FSE test decompresses enwik8 at 440 MB/s with branchless 24-bit refill with in-memory chunks and 2x 12-bit decodes per refill. rANS is a lot slower.

  32. Thanks (2):

    Bulat Ziganshin (10th February 2020),Shelwien (10th February 2020)

  33. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,710
    Thanks
    270
    Thanked 1,184 Times in 655 Posts
    > rANS is a lot slower

    I'd prefer if you didn't switch to static entropy coding.
    Sure, it greatly simplifies everything, and compression difference seems to be small.
    But we already have codecs which evolved in that direction (oodle,brotli,zstd).
    And I think that adaptive coding is more interesting.

  34. #25
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts
    I'll experiment more with this forward optimal parser then, it is more flexible than the previous lowest cost from end parser. I'll keep posting ideas as I think about them and try them.

    Due to long legal review proceses where I work in the US, the only readily available projects are public domain, so much of what I write is out of necessity and not just a hobby. I'll make a branch for static entropy and larger chunks, but yes the current system should be continued because it's a lot more interesting especially with forward parsing. I'm not confident that static entropy will work nearly as well with multiple command types e.g. 0 for literal 1 for match and further bits indicate dictionary or rep.

  35. #26
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    68
    Thanks
    26
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by cade View Post
    I'll experiment more with this forward optimal parser then, it is more flexible than the previous lowest cost from end parser. I'll keep posting ideas as I think about them and try them.

    Due to long legal review proceses where I work in the US, the only readily available projects are public domain, so much of what I write is out of necessity and not just a hobby. I'll make a branch for static entropy and larger chunks, but yes the current system should be continued because it's a lot more interesting especially with forward parsing. I'm not confident that static entropy will work nearly as well with multiple command types e.g. 0 for literal 1 for match and further bits indicate dictionary or rep.
    I see LZ + ari is very powerful, but extra bits for decisions if literal or match, and match positions and match lengths fall into the job of arith coder.

    Prefix bit if literal or match not needed. No match_length output either for other matches, nor literals_length, if done via my "reduced length lz". If you want, you can implement it.

    https://encode.su/threads/2923-orz-a...ll=1#post63540

    You maintain a write buffer for decoding, to be flushed later for writing to output file. The surprising technique is to output the literals last in the write buffer, after decoding the match strings. I have not tried or coded this yet, but is clearly very compact LZ.

  36. #27
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts
    You would still have to indicate how many literals to skip, so 0/1 is replaced with a counter of skipped literals + next match info. Given that results are mostly LZ matches (>50-95+%), it might be better to entropy code 4 (or more?) decisions at once (i.e. alphabet of 0000 to 1111). Decoding speed would also be improved (1 symbol contains 4 or more flags).

  37. #28
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    68
    Thanks
    26
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by cade View Post
    You would still have to indicate how many literals to skip, so 0/1 is replaced with a counter of skipped literals + next match info. Given that results are mostly LZ matches (>50-95+%), it might be better to entropy code 4 (or more?) decisions at once (i.e. alphabet of 0000 to 1111). Decoding speed would also be improved (1 symbol contains 4 or more flags).
    No need how many literals to skip, the match strings are written by pos in the write buffer, so they fall in their correct positions in the write buffer or output file.

    > You would still have to indicate how many literals to skip, so 0/1 is replaced with a counter of skipped literals + next match info.

    You think of a write buffer for decoding. Maybe the "how many literals to skip you mean" are actually match strings here, already written before writing the literals. As i said, the literals nicely fit into the "holes" or gaps "not activated" by the match strings. "Counters of skipped literals" are not needed. Just look at the write buffer.

    So during encoding, you can choose to just output all the literals as just one array of characters, maybe at the end of the output file too, or for block-based encoding which is more practical to code shorter match pos, at the end of the block.

  38. #29
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts
    How do you transmit the size of the holes between matches, or how do you transmit positions for match strings?

  39. #30
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    68
    Thanks
    26
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by cade View Post
    How do you transmit the size of the holes between matches, or how do you transmit positions for match strings?
    > How do you transmit the size of the holes between matches,

    There are no size of holes, there are just characters written to holes, whatever the sizes of those holes, 1 char, 2 chars, or n chars, the literals are just written one by one into the holes.

    Decode Output buffer (one appropriately-sized block):

    [STRING..STRING....STRING.STRING.....STRING.STRING]
    (2 holes, 4 holes, 1 hole, 5 holes, 1 hole)


    > how do you transmit positions for match strings?

    You transmit it the same way. They are positions in the write buffer or block. They are the right or correct positions of the strings in the block, that's why the strings are decodable.

    Note that for one duplicated string you need to encode the initial string in this algorithm, and then go on to encode the similar strings using only pos without match_len. Then find other strings that are duplicated in the file/block and encode them.
    Last edited by compgt; 12th February 2020 at 14:51.

Page 1 of 2 12 LastLast

Similar Threads

  1. Replies: 23
    Last Post: 24th March 2018, 18:57
  2. Optimal Deflate
    By FunkyBob in forum Data Compression
    Replies: 14
    Last Post: 15th June 2017, 16:46
  3. Optimal Preprocessing/Parsing for LZ?
    By comp1 in forum Data Compression
    Replies: 68
    Last Post: 11th February 2015, 20:27
  4. [LZ] Optimal parsing
    By Bulat Ziganshin in forum Data Compression
    Replies: 14
    Last Post: 19th March 2014, 22:56
  5. optimal parsing
    By flounder in forum Data Compression
    Replies: 10
    Last Post: 3rd August 2008, 14:07

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •