Page 2 of 2 FirstFirst 12
Results 31 to 41 of 41

Thread: I have no luck with additional compression TS40.txt...

  1. #31
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    125
    Thanks
    36
    Thanked 13 Times in 9 Posts
    Hm, I tried to compress offsets.txt from my archive above with lzpm & lzturbo -32 (-39). Both refused to compress and just added their own header to the file...

    lzpm: 10000 bytes -> 10636 bytes
    lzturbo: 10000 bytes -> 10033 bytes.

    But where is ARI/tANS?

  2. #32
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,964
    Thanks
    295
    Thanked 1,297 Times in 735 Posts
    Well, low bytes of these offsets (if you mean that, since they don't fit in a byte) do seem pretty random.
    CM (paq8px,nzcc) does compress them to 9950 or so, but certainly not plain "ARI/tANS".

    The usual trick is parsing optimization though - there're usually multiple candidates for matches
    (multiple instances of a word etc), so you can choose the one that is most compressible in context of others.

  3. #33
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    125
    Thanks
    36
    Thanked 13 Times in 9 Posts
    It's not low bytes, it's 10000 9-bits offsets (from offset slot № 1).
    For Rapid compression (40 sec. for compress/decompress for 1Gb, 1 sec. == 1 Mb) parsing is not suitable, I'm using 5 hash tables instead...

  4. #34
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,964
    Thanks
    295
    Thanked 1,297 Times in 735 Posts
    > parsing is not suitable,

    Maybe offline optimization of a heuristic parser?
    I mean, see what slow parsing optimizer outputs and how to predict its behavior from hash matches and input data?
    Eg. LZMA a0 is an example of that, and even a1 is not 100% bruteforce.

    > I'm using 5 hash tables instead...

    If you need fast encoding anyway, maybe try ROLZ?

  5. #35
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    125
    Thanks
    36
    Thanked 13 Times in 9 Posts
    Regarding ROLZ I saw only idea on ru.wikipedia.org/wiki/ROLZ.

  6. #36
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Quote Originally Posted by lz77 View Post
    Regarding ROLZ I saw only idea on ru.wikipedia.org/wiki/ROLZ.
    English Wikipedia removed the article because it wasn't citing well known sources. But this was years ago and it might be put up again as there are more references now.

  7. #37
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,964
    Thanks
    295
    Thanked 1,297 Times in 735 Posts
    http://mattmahoney.net/dc/dce.html#Section_525

    Basically use offset within a hashtable cell instead of global distance.
    Same has to be done during decoding, so decoding is slower.
    But compression is better.
    Its something like intermediate step from LZ to PPM.

    Christian's RZ is ROLZ.

  8. #38
    Member
    Join Date
    Jan 2014
    Location
    USA
    Posts
    6
    Thanks
    10
    Thanked 3 Times in 2 Posts
    English wikipedia contains a refererence to "QUAD / BALZ / BCM (highly-efficient ROLZ-based compressors)" on the archiver comparison page where it discusses PeaZip under the heading Uncommon archive format support, but the words shown when hovering the cursor over ROLZ indicates the page is still missing: "ROLZ (Page does not exist)".

  9. #39
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    125
    Thanks
    36
    Thanked 13 Times in 9 Posts
    Quote Originally Posted by Shelwien View Post
    Same has to be done during decoding, so decoding is slower.
    But compression is better.
    Decoding is slower, but in the scoring formula (F=c_time+2*d_time+...), the speed is multiplied by 2...

    As I understand ROLZ, if we have in current position 'the ' and the predecessor symbol is 'n', then we see in a history buffer 256 saved hashes for 'the ' only with the predecessor symbol 'n'? And if the search fails then 't' is a literal? Are there too many literals?

    Is LZP also like ROLZ?

  10. #40
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,964
    Thanks
    295
    Thanked 1,297 Times in 735 Posts
    > the speed is multiplied by 2...

    Yes, ROLZ can provide much better compression with a fast parsing strategy,
    which might be good for the competition.

    > Are there too many literals?

    Mostly the same as with normal LZ77.
    LZ77 would usually already work like that - take a context hash and
    go through a hash-chain list to check previous context matches -
    the difference is that LZ77 then encodes match distance,
    while ROLZ would encode the number of hashchain steps.

    > Is LZP also like ROLZ?

    LZP is a special case of ROLZ with only one match per context hash.
    So it encodes only length or literals, no distance equivalent.

    But LZP is rarely practical on its own - its commonly used as a dedup preprocessor
    for some stronger algorithm.

  11. #41
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    125
    Thanks
    36
    Thanked 13 Times in 9 Posts
    After preprocessing TS40.txt by my preprocessor my compressor compresses it on 16Mb better, it's great?

    Yesterday I thought 2 times about the ROLZ:

    1. Compressor can calculate hashes with bytes (for example, with cdef) which are located on the right of the current position: abcd|efgh, but decompressor can't: abcd|????. Are matches near current position impossible in ROLZ?

    2. At the beginning of the data classic LZ can compress abcd but ROLZ can't: abcdEabcdFabcd...Eabcd... The first appearance of abcd has no predecessor char, the second has not right one, and only second Eabcd will match...

    Did I correctly understand that these are the disadvantages of the ROLZ algorithm?

Page 2 of 2 FirstFirst 12

Similar Threads

  1. Replies: 4
    Last Post: 25th October 2018, 00:31
  2. NLP and compression of TXT files
    By BetaTester in forum Data Compression
    Replies: 0
    Last Post: 13th June 2012, 22:34
  3. APPNOTE.TXT
    By ggf31416 in forum Forum Archive
    Replies: 0
    Last Post: 30th September 2006, 16:04

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •