Results 1 to 14 of 14

Thread: flzp, new LZP compressor/preprocessor

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

    flzp, new LZP compressor/preprocessor

    I wrote a new open source (GPL) bytewise LZP compressor called flzp,
    http://cs.fit.edu/~mmahoney/compression/text.html#4975

    It is fast but compression isn't very impressive. It is mainly interesting because it can be used as a preprocessor to another low-order compressor and improve the compression and speed over either one alone, for example:

    57,366,279 enwik8.flzp 8 sec (2.2 GHz Athlon-64, WinXP Home)
    63,391,013 enwik8.fpaq0 36 sec
    39,879,072 enwik8.flzp.fpaq0 8+21 sec

    It also improves compression for ppmd -o2 or -o3 (but not higher orders).

    It uses a simple algorithm. It divides the input into 64K blocks and any byte that doesn't appear in the block is used to code match lengths into a rotating 4MB buffer using a 1M hash table for an index. Literals are not compressed. The hash table is indexed by an order-4 context hash. Any unused byte is available to code matches, but if there are less than 32 then the block size is reduced. It doesn't need to use escape codes. In the worst case, a 223 byte block with each byte different would be expanded by 33 bytes (a 32 byte bitmap header showing available codes plus an end of block symbol).

    Speed is asymmetric. It decompresses twice as fast as it compresses because the compressor has to make 2 passes over each block. The first pass is to get a list of available bytes and determine the block size.

    I might add a low order context model to make a fast program with decent compression. Or maybe I will modify it to only code long matches as a preprocessor to a BWT compressor (to speed sorting) or CM (to reduce input size for speed). Currently it codes matches of length 2 or more as 1 byte.

  2. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    clns (2nd March 2019)

  3. #2
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts

    HI MATT!

    Interesting pre-processor - compressor ! Very Good Job!

  4. #3
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Thanks Matt!

    Just as a note, you may use a XOR instead of ADD in a hash function:
    Code:
    // Update hash of last L bytes through c
    inline void Buf::update_hash(int c) {
      h=h*(3<<5)^c&HN-1;
    }
    According to my experiments XOR often better...

  5. #4
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Good job! I might include something like this with further demo coders:

    Code:
    cm@e051 ~/shared/projects/demo coders/m012_0.2_080602/bin/Release $ ./m012 c /tmp/enwik8.flzp /tmp/enwik8.flzp.m012
    M012 v0.2 by C. Mattern  Jun  2 2008
    Experimental file compressor.
    Allocated 70165 kB
    Encoding...done. 30619962/57366279 (4.270 Bpc) in 130.750 s (only slower, since i increased cpu freq. while coding)
    cm@e051 ~/shared/projects/demo coders/m012_0.2_080602/bin/Release $ ./m012 c ~/shared/temp/testset/enwik8 /tmp/enwik8.m012
    M012 v0.2 by C. Mattern  Jun  2 2008
    Experimental file compressor.
    Allocated 70165 kB
    Encoding...done. 35086984/100000000 (2.807 Bpc) in 97.600 s

  6. #5
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts

    Thumbs up

    Thanks Matt!

  7. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    735
    Thanked 660 Times in 354 Posts
    Matt, are you aware about http://www.compression.ru/ds/lzp.rar ?

    afaiu, your variant should be much better due to direct encoding of match length. hope that you just skip compressing of small blocks which uses more than 256-33 symbols

    one particular problem with flzp is that some variants of dictionary preprocessing (which also uses free symbols) isn't "compatible" with flzp

  8. #7
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    870
    Thanks
    47
    Thanked 105 Times in 83 Posts
    jibiii
    another filter/preprocessor for me to play with.

    going to teste this

    wait for the weekend for feedback

  9. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    I tested FLZP and Shkarin's LZP preprocessor:

    Code:
    world95.txt 2988578
            
               filtered  +o1rc8
    LZP -l3    1482089   745027
    LZP -l5    1538783   764276
    LZP -l28   2232612  1038532
    LZP -l30   2274628  1055459
    LZP -l32   2314187  1071467
    flzp       1251473   748757
    o1rc8 is a order1 coder with mixing from ST2rc v4.

  10. #9
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    870
    Thanks
    47
    Thanked 105 Times in 83 Posts
    NAah nothing to be excited about

    some small teste (dailbo 2 +LoD cd iamges)

    INSTALL.iso - 535.545.856
    INSTALL.iso.flzp - 538.846.318

    CINEMATICS.iso - 443.781.120
    CINEMATICS.iso.flzp - 462.999.598

    PLAYDISC.bin.ecm - 611.417.493
    PLAYDISC.bin.ecm.flzp - 635.288.152

    EXPANSION.bin.ecm - 538.982.326
    EXPANSION.bin.ecm.flzp - 546.512.107


    Compressed size 7.zip ultra LZMA 192mb = 1.972.714.300
    -||- with FLZP as preprocessor = 2.063.677.861


    conclusion (on these files)
    The individually file GREW in size after "compression".
    Also the further compressibility was not improved


    any suggestions for when its a better choice to use FLZP as preprocessor ?

  11. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    735
    Thanked 660 Times in 354 Posts
    Sven, it needs 32 unused symbols, so work only on text files

  12. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    So, we need another preprocessor.
    Which would convert bits to bytes or nibbles to bytes, so that there're always some some unused symbols in the alphabet.

  13. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by SvenBent View Post
    conclusion (on these files)
    The individually file GREW in size after "compression".
    Also the further compressibility was not improved

    any suggestions for when its a better choice to use FLZP as preprocessor ?
    It is best for highly redundant files.

    calgary\pic (513216) -> x (78242) in 0.06 s
    maxcomp\fp.log (20617071) -> x (1311552) in 1.33 s

    Then it works as a preprocessor for ppmd -o2

    513,216 PIC
    51,130 PIC.pmd-o2
    78,242 PIC.flzp
    51,033 PIC.flzp.pmd-o2

    20,617,071 fp.log
    2,258,063 fp.log.pmd-o2
    1,311,552 fp.log.flzp
    750,480 fp.log.flzp.pmd-o2

    On random or already compressed data it divides the input into small blocks in order to have 32 reserved symbols. Each block then adds a 32 byte header and one byte for end of block. In the worst case it expands 223 bytes to 256 (114.8%). It is also useless as a preprocessor. For example on random data:

    100,000 r256
    101,962 r256.pmd-o2
    106,366 r256.flzp
    107,035 r256.flzp.pmd-o2

    Yeah, there are ways to fix this, but I wanted to keep the format simple (no multi-byte codes) so that you could optimize the compressor without changing the compressed format. For example, if the block size is small you could reserve fewer symbols and still be compatible with the first version. The trick is to find the best rule.
    Last edited by Matt Mahoney; 21st June 2008 at 03:24.

  14. #13
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    870
    Thanks
    47
    Thanked 105 Times in 83 Posts
    Quote Originally Posted by Matt Mahoney View Post
    It is best for highly redundant files.

    calgary\pic (513216) -> x (78242) in 0.06 s
    maxcomp\fp.log (20617071) -> x (1311552) in 1.33 s

    Then it works as a preprocessor for ppmd -o2
    Thanx for the info

  15. #14
    Member
    Join Date
    May 2008
    Location
    USA
    Posts
    44
    Thanks
    0
    Thanked 3 Times in 3 Posts
    Quote Originally Posted by Matt Mahoney View Post
    It divides the input into 64K blocks and any byte that doesn't appear in the block is used to code match lengths
    This shows just how much I'm an amateur at this still... I can't believe I hadn't thought of this before. Zero bit handling! (which is great, since the platform target I'm programming for is not that great at bit manipulation). Now I have to think about this for my stuff already in progress

Similar Threads

  1. Dict preprocessor
    By pat357 in forum Data Compression
    Replies: 5
    Last Post: 2nd May 2014, 21:51
  2. LZP flag sequence compression
    By Shelwien in forum Data Compression
    Replies: 8
    Last Post: 9th August 2009, 02:08
  3. flzp_ac2 (flzp + an order-2 arithmetic coder)
    By inikep in forum Data Compression
    Replies: 4
    Last Post: 25th June 2008, 21:37
  4. FLASHZIP new ARI+LZP compressor
    By Nania Francesco in forum Forum Archive
    Replies: 65
    Last Post: 5th February 2008, 21:42
  5. impresseed by RPE preprocessor
    By SvenBent in forum Forum Archive
    Replies: 6
    Last Post: 24th October 2007, 12:43

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •