Results 1 to 13 of 13

Thread: I have no luck with additional compression TS40.txt...

  1. #1
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    100
    Thanks
    28
    Thanked 13 Times in 9 Posts

    I have no luck with additional compression TS40.txt...

    My simple kids LZ77 compressor is faster than lzturbo -22 but compresses the file TS40.txt better. Also I wrote simple preprocessor for this file. Preprocessor + fast LZ77 compress TS40.txt up to 143 Mb. But I want to get final size at least 120 Mb. I tried to use static Huffman for additional compression: I've gathered 8 upper bits from each offset in one file (62 Mb), then I compressed it with the Huffman codec. After that I expected to get at least 35 Mb, but got 60 Mb! I'm surprised: lzturbo without any preprocessor compresses TS40.txt up to 125 Mb. How and what lzturbo compresses after LZ77? Huffman compression of literals will not give much because the size of literals is 4.2 Mb. Does tANS/rANS compress much better than static Huffman? Don't believe in it...
    Last edited by lz77; 1st August 2020 at 16:08.

  2. #2
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    38
    Thanks
    8
    Thanked 10 Times in 7 Posts
    I think the best way to see potential optimizations is to gather a lot of statistics (distribution and frequency of matches for len 1 to max len, match pos, literal run lengths, match run lengths etc.).

  3. #3
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    100
    Thanks
    28
    Thanked 13 Times in 9 Posts
    Hm... I doubt that Huffman compression literal/match counters as well as prefixes can give something...
    Yes, match length = 4 is very common, but what does it do?
    By the way: what's compressed after LZ4 in zstd: offsets or counters?

    Ich bin verwirrt...

  4. #4
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    100
    Thanks
    28
    Thanked 13 Times in 9 Posts
    lzturbo -22 -b450 (LZ77 only) compresses TS40.txt up to 167.9 Mb.
    lzturbo -32 -b450 (LZ77+"Asymmetric Numeral System TurboANX" (I think, should be TurboANS...)) compresses it up to 125 Mb. How lzturbo with TurboANS decreases compressed size on 43 Mb and spends 1 sec. on it??

    May be lzturbo -32 is not LZ77+TurboANS but something other than LZ77+TurboANS?

  5. #5
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    38
    Thanks
    8
    Thanked 10 Times in 7 Posts
    Quote Originally Posted by lz77 View Post
    Hm... I doubt that Huffman compression literal/match counters as well as prefixes can give something...
    Yes, match length = 4 is very common, but what does it do?
    By the way: what's compressed after LZ4 in zstd: offsets or counters?

    Ich bin verwirrt...
    Mne tozhe!

    Well I thought along the lines of looking at stats for lengths (e.g. 4624 len 2 matches, 3511 len 3 matches...), distances, literal run lengths between matches (0 to n bytes).
    For the len/offset relationship I used a matrix with either linear scale (e.g. for lengths) or binary (log2) scale for distances.

    And either by applying the encoding somehow to the stats or calculating the costs for encoding and providing stats for it, will also show, where the encoding might cost too many bits.

    This might really help getting ideas for improvements. Did you look at the probabilities of getting repeated offsets (same offset as one of the last n)?

    Re LZ4/Zstd: I don't know. Maybe the details described in https://tools.ietf.org/html/rfc8478 will help.

  6. #6
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    100
    Thanks
    28
    Thanked 13 Times in 9 Posts
    Unfortunately, I'm not an English speaker, I learned German (long ago...) I understand not all the LZ77 compression improvements yet.

    I see no relation between len/offset. Also I see no repeated offsets when compressing enwik8. I see only repeated match strings like 'the ', ', and ', 'and ', 'of the ', ...
    If I want to use history of matched substrings, I will need one more prefix for these codewords? For example, if I'm using 4 prefixes 00-11 for 4 length of offset, I will have to go to 3 bit prefixes. Won't those 3-bit prefixes eat up the benefits of using them?

    I'm using a minimum match length of 4 bytes as LZ4. How could you use the match length of 2 and 3 bytes?

    I'm going to use some ideas from the thread "Reduced Length LZ (RLLZ): One way to output LZ77 codes". I doubt that it gives a big win, but after using this idea, the compression will no longer be so fast...

  7. #7
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    38
    Thanks
    8
    Thanked 10 Times in 7 Posts
    Quote Originally Posted by lz77 View Post
    Unfortunately, I'm not an English speaker, I learned German (long ago...) I understand not all the LZ77 compression improvements yet.
    I think, this is a nice opportunity to learn something new. When I did my research of current LZ versions and probably forgotten ones from the past, I found a lot of interesting ideas!

    Quote Originally Posted by lz77 View Post
    I see no relation between len/offset. Also I see no repeated offsets when compressing enwik8. I see only repeated match strings like 'the ', ', and ', 'and ', 'of the ', ...
    If I want to use history of matched substrings, I will need one more prefix for these codewords? For example, if I'm using 4 prefixes 00-11 for 4 length of offset, I will have to go to 3 bit prefixes. Won't those 3-bit prefixes eat up the benefits of using them?
    There is no strong relation for specific values (e.g. offset 360, len 5), but longer matches usually are further away (see links below), as the probability to find a combination of letters (or other symbols) shrinks with the length.

    Instead of using another prefix bit, you might redistribute your current offset ranges among these 4 subdivisions. Or really create a variable length prefix tree (00, 01, 10, 110, 111 or sth. else, costing only additional bits for the biggest offsets, which most likely involve longer match lengths, being more seldom but saving lots of bytes already).

    For such things I used detailed stats in Excel to play around with.

    Quote Originally Posted by lz77 View Post
    I'm using a minimum match length of 4 bytes as LZ4. How could you use the match length of 2 and 3 bytes?
    I found them useful for smaller binary files. It just depends. Large texts could contain more of the longer matches, of course. But it's certainly not wrong to look at the frequency of shorter lenghts (with shorter offsets, since a 2 byte match with all flags etc. should be smaller than 2 literals). These two pages give some interesting insights based on an (also linked) "FV" tool:
    http://mattmahoney.net/dc/textdata.html
    http://www.fantascienza.net/leonardo...tatistics.html (which is linked from Matt's page)

    Quote Originally Posted by lz77 View Post
    I'm going to use some ideas from the thread "Reduced Length LZ (RLLZ): One way to output LZ77 codes". I doubt that it gives a big win, but after using this idea, the compression will no longer be so fast...
    This would at least be interesting to see its capabilities. I made some notes while playing it through in my mind, adding/changing some bits of that algo already at that level. I'll send you the bullet point list via PM.

  8. Thanks:

    lz77 (5th August 2020)

  9. #8
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    100
    Thanks
    28
    Thanked 13 Times in 9 Posts
    Thanks, but I haven't figured out how to use this to improve my compressor yet. All the more so for winning 3000 euros.
    For example, if I exclude literal length bits from my code words, I will improve ratio on ~1.5%...

    I've tried to use 00, 01, 10, 110, 111 prefixes, but it made the compression worse.

    At this time my baby LZ77 compressor without compiler optimizations compresses a bit better and 20% faster than lzturbo -p0 -22 -b1024... on I3 5005U. May be on newer CPUs the comparison results will be different...

  10. #9
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    38
    Thanks
    8
    Thanked 10 Times in 7 Posts
    Quote Originally Posted by lz77 View Post
    Thanks, but I haven't figured out how to use this to improve my compressor yet. All the more so for winning 3000 euros.
    For example, if I exclude literal length bits from my code words, I will improve ratio on ~1.5%...

    I've tried to use 00, 01, 10, 110, 111 prefixes, but it made the compression worse.

    At this time my baby LZ77 compressor without compiler optimizations compresses a bit better and 20% faster than lzturbo -p0 -22 -b1024... on I3 5005U. May be on newer CPUs the comparison results will be different...
    For RLLZ you might have to rewrite and rebalance everything.

    You already have good results. At this stage you might turn it into an optimization problem with parameter tuning.
    Last edited by Dresdenboy; Yesterday at 11:38. Reason: typo

  11. #10
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    100
    Thanks
    28
    Thanked 13 Times in 9 Posts
    I still can't figure out how to decompress LZT data if there are no literal counters...And I want to compress TS40.txt to at least 120 Mb in 10 seconds. But LZ77 only will not be able to do this.

  12. #11
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    38
    Thanks
    8
    Thanked 10 Times in 7 Posts
    Quote Originally Posted by lz77 View Post
    I still can't figure out how to decompress LZT data if there are no literal counters...And I want to compress TS40.txt to at least 120 Mb in 10 seconds. But LZ77 only will not be able to do this.
    As comgt described, you have a literals data block (in his variant only containing literals, which weren't matched) and the match strings (literal string + positions to copy it to) in some list. Example:

    Code:
    Original data: 
    ABRACADABRA12AD
    0123456789abcde (position)
    Match strings:
    ABRA - copy to pos 0 and 7 (ABRA...ABRA....)
    AD - copy to pos 5 and d (ABRA.ADABRA..AD)
    Literals left over:
    C12 - copy into remaining spots "."
    Re LZ77: It might be difficult to get there. With just over 4 MB literals left, the length < 4 matches can't remove more than that (and cost encoding bits).

    So the most surely could be gained from the encoding.

  13. #12
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    100
    Thanks
    28
    Thanked 13 Times in 9 Posts
    But compressed data must contain positions 0 and 7, 5 and d? How can they take fewer bits than literal counters 2-4 bits long?

    And how I can use offset history buffer? Among 256 offsets will not be the same ones. And their high bytes are unlikely to be equal...

  14. #13
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    38
    Thanks
    8
    Thanked 10 Times in 7 Posts
    Quote Originally Posted by lz77 View Post
    But compressed data must contain positions 0 and 7, 5 and d? How can they take fewer bits than literal counters 2-4 bits long?

    And how I can use offset history buffer? Among 256 offsets will not be the same ones. And their high bytes are unlikely to be equal...
    I think it might just be better than current LZ77-family-like algorithms (LZB etc.) in specific situations, e.g. with a high percentage of literal bytes and still multiple reuses for matched strings (so the savings for matchs length get bigger), or just a high number of reused strings overall (texts). But this remains to be seen.

    Possible difficulties in staying efficient come from overlapped matches (like ABRA and BRA, see my notes for leaving match strings in the literals block).

Similar Threads

  1. Replies: 4
    Last Post: 25th October 2018, 00:31
  2. NLP and compression of TXT files
    By BetaTester in forum Data Compression
    Replies: 0
    Last Post: 13th June 2012, 22:34
  3. APPNOTE.TXT
    By ggf31416 in forum Forum Archive
    Replies: 0
    Last Post: 30th September 2006, 16:04

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •