Page 2 of 2 FirstFirst 12
Results 31 to 46 of 46

Thread: I have no luck with additional compression TS40.txt...

  1. #31
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    128
    Thanks
    39
    Thanked 13 Times in 9 Posts
    Hm, I tried to compress offsets.txt from my archive above with lzpm & lzturbo -32 (-39). Both refused to compress and just added their own header to the file...

    lzpm: 10000 bytes -> 10636 bytes
    lzturbo: 10000 bytes -> 10033 bytes.

    But where is ARI/tANS?

  2. #32
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,984
    Thanks
    298
    Thanked 1,310 Times in 746 Posts
    Well, low bytes of these offsets (if you mean that, since they don't fit in a byte) do seem pretty random.
    CM (paq8px,nzcc) does compress them to 9950 or so, but certainly not plain "ARI/tANS".

    The usual trick is parsing optimization though - there're usually multiple candidates for matches
    (multiple instances of a word etc), so you can choose the one that is most compressible in context of others.

  3. #33
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    128
    Thanks
    39
    Thanked 13 Times in 9 Posts
    It's not low bytes, it's 10000 9-bits offsets (from offset slot № 1).
    For Rapid compression (40 sec. for compress/decompress for 1Gb, 1 sec. == 1 Mb) parsing is not suitable, I'm using 5 hash tables instead...

  4. #34
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,984
    Thanks
    298
    Thanked 1,310 Times in 746 Posts
    > parsing is not suitable,

    Maybe offline optimization of a heuristic parser?
    I mean, see what slow parsing optimizer outputs and how to predict its behavior from hash matches and input data?
    Eg. LZMA a0 is an example of that, and even a1 is not 100% bruteforce.

    > I'm using 5 hash tables instead...

    If you need fast encoding anyway, maybe try ROLZ?

  5. #35
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    128
    Thanks
    39
    Thanked 13 Times in 9 Posts
    Regarding ROLZ I saw only idea on ru.wikipedia.org/wiki/ROLZ.

  6. #36
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Quote Originally Posted by lz77 View Post
    Regarding ROLZ I saw only idea on ru.wikipedia.org/wiki/ROLZ.
    English Wikipedia removed the article because it wasn't citing well known sources. But this was years ago and it might be put up again as there are more references now.

  7. #37
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,984
    Thanks
    298
    Thanked 1,310 Times in 746 Posts
    http://mattmahoney.net/dc/dce.html#Section_525

    Basically use offset within a hashtable cell instead of global distance.
    Same has to be done during decoding, so decoding is slower.
    But compression is better.
    Its something like intermediate step from LZ to PPM.

    Christian's RZ is ROLZ.

  8. #38
    Member
    Join Date
    Jan 2014
    Location
    USA
    Posts
    9
    Thanks
    11
    Thanked 3 Times in 2 Posts
    English wikipedia contains a refererence to "QUAD / BALZ / BCM (highly-efficient ROLZ-based compressors)" on the archiver comparison page where it discusses PeaZip under the heading Uncommon archive format support, but the words shown when hovering the cursor over ROLZ indicates the page is still missing: "ROLZ (Page does not exist)".

  9. #39
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    128
    Thanks
    39
    Thanked 13 Times in 9 Posts
    Quote Originally Posted by Shelwien View Post
    Same has to be done during decoding, so decoding is slower.
    But compression is better.
    Decoding is slower, but in the scoring formula (F=c_time+2*d_time+...), the speed is multiplied by 2...

    As I understand ROLZ, if we have in current position 'the ' and the predecessor symbol is 'n', then we see in a history buffer 256 saved hashes for 'the ' only with the predecessor symbol 'n'? And if the search fails then 't' is a literal? Are there too many literals?

    Is LZP also like ROLZ?

  10. #40
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,984
    Thanks
    298
    Thanked 1,310 Times in 746 Posts
    > the speed is multiplied by 2...

    Yes, ROLZ can provide much better compression with a fast parsing strategy,
    which might be good for the competition.

    > Are there too many literals?

    Mostly the same as with normal LZ77.
    LZ77 would usually already work like that - take a context hash and
    go through a hash-chain list to check previous context matches -
    the difference is that LZ77 then encodes match distance,
    while ROLZ would encode the number of hashchain steps.

    > Is LZP also like ROLZ?

    LZP is a special case of ROLZ with only one match per context hash.
    So it encodes only length or literals, no distance equivalent.

    But LZP is rarely practical on its own - its commonly used as a dedup preprocessor
    for some stronger algorithm.

  11. #41
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    128
    Thanks
    39
    Thanked 13 Times in 9 Posts
    After preprocessing TS40.txt by my preprocessor my compressor compresses it on 16Mb better, is it great?

    Yesterday I thought 2 times about the ROLZ:

    1. Compressor can calculate hashes with bytes (for example, with cdef) which are located on the right of the current position: abcd|efgh, but decompressor can't: abcd|????. Are matches near current position impossible in ROLZ?

    2. At the beginning of the data classic LZ can compress abcd but ROLZ can't: abcdEabcdFabcd...Eabcd... The first appearance of abcd has no predecessor char, the second has not right one, and only second Eabcd will match...

    Did I correctly understand that these are the disadvantages of the ROLZ algorithm?
    Last edited by lz77; 23rd September 2020 at 12:48.

  12. #42
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,984
    Thanks
    298
    Thanked 1,310 Times in 746 Posts
    > After preprocessing TS40.txt by my preprocessor my compressor compresses it on 16Mb better, is it great?

    Seems reasonable:
    Code:
    100,000,000 enwik8
     61,045,002 enwik8.mcm      // mcm -store
     25,340,456 enwik8.zst      // zstd -22
     24,007,111 enwik8.mcm.zst  // zstd -22
    > Did I correctly understand that these are the disadvantages of the ROLZ algorithm?

    No, ROLZ matchfinder can be exactly the same as LZ77 one -
    the only required (to be called ROLZ) difference is encoding of match rank (during search)
    instead of distance/position, thus forcing decoder to also run the matchfinder.

  13. Thanks:

    lz77 (26th September 2020)

  14. #43
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    128
    Thanks
    39
    Thanked 13 Times in 9 Posts
    I found an article http://www.ezcodesample.com/rolz/rolz_article.html and saw some his examples of iROLZ...
    http://www.ezcodesample.com/rolz/ske...ctionaries.txt on enwik8, output:
    ===
    Original and compressed data sizes 100000000 50804922
    Approximate ratio relative to original size 0.345271
    ==
    Hm, 50 Mb is 34% from 100 Mb? Bad...

    http://www.ezcodesample.com/rolz/irolzstream.txt compresses ts40.txt to 42%, it's also bad...

    May be I saw wrong ROLZ sources?

    By the way: I've installed codeblocks with mingw, why I can not run debug? F8 etc. does not work...

  15. #44
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,984
    Thanks
    298
    Thanked 1,310 Times in 746 Posts
    The main problem is that default irolz has 256kb window (d18).
    You can look for ROLZ here: http://mattmahoney.net/dc/text.html

    > I've installed codeblocks with mingw, why I can not run debug?

    Maybe compile with debug options? http://wiki.codeblocks.org/index.php...h_Code::Blocks

  16. Thanks:

    lz77 (26th September 2020)

  17. #45
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    128
    Thanks
    39
    Thanked 13 Times in 9 Posts
    BALZ v1.20 (a ROLZ compressor) compresses & decompresses TS40.txt in 40 sec. instead of < 16 sec.... But ratio is worse than shows lzturbo. And lzturbo shows overall time < 10 sec.... Can BALZ be accelerated?

  18. #46
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,984
    Thanks
    298
    Thanked 1,310 Times in 746 Posts
    > BALZ v1.20 (a ROLZ compressor)

    Current version is 1.50 afaik.

    > But ratio is worse than shows lzturbo.

    Code:
    30,056,097 enwik8.balz150c
    30,808,657 enwik8.lzt32
    The difference is likely due to lzturbo using 64MB block as default,
    while BALZ uses 32MB block - try lzturbo -b32.

    > And lzturbo shows overall time < 10 sec.... Can BALZ be accelerated?

    Yes, it uses getc/putc i/o for compressed data and fread/fwrite on
    whole 32MB blocks of uncompressed data - these certainly affect
    the speed in fast mode.

    Also lzturbo likely includes all the popular speed optimization tricks,
    like SIMD and stream interleaving.

  19. Thanks:

    lz77 (30th September 2020)

Page 2 of 2 FirstFirst 12

Similar Threads

  1. Replies: 4
    Last Post: 25th October 2018, 00:31
  2. NLP and compression of TXT files
    By BetaTester in forum Data Compression
    Replies: 0
    Last Post: 13th June 2012, 22:34
  3. APPNOTE.TXT
    By ggf31416 in forum Forum Archive
    Replies: 0
    Last Post: 30th September 2006, 16:04

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •