Results 1 to 10 of 10

Thread: Compress enwik8 with ratio < 40%...

  1. #1
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    59
    Thanks
    16
    Thanked 12 Times in 8 Posts

    Question Compress enwik8 with ratio < 40%...

    Can anyone compress enwik8 with ratio < 40% using LZ77 only with a simple hash table?

  2. #2
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    124
    Thanks
    106
    Thanked 71 Times in 51 Posts
    Given the previous times you've asked roughly the same, you need to specify exactly what you mean by "LZ77 only".

    There is a whole range of options from actual LZ77 with a fixed size token for length and offset, over variable length integer encodings like LZ4, to ranges inside bytes like lzo1x, to bit-level encoding like LZSS uses for literal/match, to universal codes like Elias gamma.

  3. #3
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    236
    Thanks
    29
    Thanked 23 Times in 21 Posts
    PAQ8SK

    this is forked from paq8pxv182fix1 with some tweaking on textmodel n increase the memory usage upto 7gb. the result for xml file is:
    paq8pxv182fix1‚Äč 250750 bytes
    paq8sk 249237 bytes
    enwik8 on progress
    Attached Files Attached Files

  4. #4
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    59
    Thanks
    16
    Thanked 12 Times in 8 Posts
    Quote Originally Posted by jibz View Post
    There is a whole range of options from actual LZ77 with a fixed size token for length and offset, over variable length integer encodings like LZ4, to ranges inside bytes like lzo1x, to bit-level encoding like LZSS uses for literal/match, to universal codes like Elias gamma.
    I thought it was intuitively clear... It must be like LZ4/LZ5 without search matches. Only hash table with size ~ 512 Kb and no additional compression of mathes/literal length/literals/prefixes and offsets. Bit or byte encoding is up to you.
    Last edited by lz77; 8th April 2020 at 12:44.

  5. #5
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    78
    Thanks
    6
    Thanked 18 Times in 11 Posts
    LittleBit (https://github.com/kapenga/LittleBit) can compress Enwik8 with a static Huffman tree to less than 40%. With a 1.5mb tree it could compress enwik8 to 26,5mb and that's including the tree. With a tree limited to 512kb it would be still below 40mb because the gains after a 512kb tree are small.
    I think other methods should be possible too. It should be do-able.
    But why would you want this?

  6. #6
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    59
    Thanks
    16
    Thanked 12 Times in 8 Posts
    Quote Originally Posted by Kaw View Post
    But why would you want this?
    I wrote for sports and academic interest pure and only LZ77 type compressor (while this is a prototype program for debugging) that for example beats blzpack -1 ... & blzpack -2 ... and approaching the zstd -1 ... So I want to compare my compressor with others that I may not know about.

    I'm going to port one of my algorithms on FASM for Win64 to achieve maximum results. I would like to find buyers for my algorithms/sources.

    Where I can download Windows binaries of LittleBit to compare with mine?

  7. #7
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    170
    Thanks
    20
    Thanked 61 Times in 30 Posts
    Quote Originally Posted by Kaw View Post
    LittleBit (https://github.com/kapenga/LittleBit) can compress Enwik8 with a static Huffman tree to less than 40%. With a 1.5mb tree it could compress enwik8 to 26,5mb and that's including the tree. With a tree limited to 512kb it would be still below 40mb because the gains after a 512kb tree are small.
    I think other methods should be possible too. It should be do-able.
    But why would you want this?
    it looks like kind of word-encoding + huffman algorithm, but the compression ratio is really surprising me...

  8. Thanks:

    Kaw (11th April 2020)

  9. #8
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    78
    Thanks
    6
    Thanked 18 Times in 11 Posts
    Quote Originally Posted by RichSelian View Post
    it looks like kind of word-encoding + huffman algorithm, but the compression ratio is really surprising me...
    Thanks!
    I think the technology behind it is more impressive than it get credits for.
    It's indeed static word-encoding and huffman, but there is science behind creating the optimal tree.

  10. #9
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    78
    Thanks
    6
    Thanked 18 Times in 11 Posts
    Quote Originally Posted by lz77 View Post
    Where I can download Windows binaries of LittleBit to compare with mine?
    Its java so you need to download the Java runtime stuff. On https://github.com/kapenga/LittleBit/releases you can find a release you can use with an example how to run it.

  11. #10
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    59
    Thanks
    16
    Thanked 12 Times in 8 Posts
    Two my simple frugal examples (while these are prototypes for debugging and improvements) comparing with famous leaders in data compression.
    Benchmarked on Topton mini PC from Aiexpress: CPU I7 8850H, RAM 16 Gb @ 2.667 MHz,L1 cache 384 Kb, L2 1.5 Mb, L3 9 Mb.
    I've boot the PC on 6 cores/12 logical CPUs with a speed 4.18 Ghz, RAM disk 2 Gb, pagefile.sys is off.


    Some command lines:
    timer.exe lz4x64_1.9.2.exe -1 -f -l --no-frame-crc enwik9
    timer.exe lizard-1.0-win64.exe -30 -f -B7 --no-frame-crc enwik9
    timer.exe zstd_1.4.4_win64.exe -1 -f --no-progress enwik9


    Code:
    enwik9	       compress	user/process time |	uncompress user/process time	| ratio
    
    
    myex1			7.85/8.52	  |		   2.45/3.31		| 0.36454
    myex2			6.36/7.06	  |		   2.02/2.92		| 0.40813
    myex2 on asm				  |		   1.66/2.63
    blzpack -b1000m		7.31/8.14	  |		   3.81/4.70		| 0.38093
    brotli_1.07_win64 -0	3.47/3.83	  |		   3.02/3.42		| 0.37384
    lz4_1.9.2_win64		1.91/2.34	  |		   0.30/0.81		| 0.50933
    lz4_1.9.2_win64 -3     11.73/11.968	  |					| 0.3886
    lizard_1.0_win64 -22	7.61/8.14	  |		   0.56/1.14		| 0.4244
    lizard_1.0_win64 -30	3.17/3.73	  |		   1.06/1.44		| 0.42084
    zstd_1.4.4_win64 -1	3.14/3.42	  |		   1.05/1.52		| 0.35752

Similar Threads

  1. enwik8 rules
    By thometal in forum Data Compression
    Replies: 1
    Last Post: 7th February 2020, 01:05
  2. Replies: 0
    Last Post: 29th January 2020, 15:26
  3. Some perl scripts for enwik8 parsing
    By Shelwien in forum Data Compression
    Replies: 3
    Last Post: 3rd March 2019, 06:29
  4. Highest LZ77 compression of enwik8
    By lz77 in forum Data Compression
    Replies: 8
    Last Post: 12th August 2017, 14:58
  5. Replies: 3
    Last Post: 10th November 2007, 21:32

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •