Results 1 to 11 of 11

Thread: Kwc – very simple keyword compressor

  1. #1
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    787
    Thanks
    64
    Thanked 274 Times in 192 Posts

    Kwc – very simple keyword compressor

    Exactly two years ago I wrote Kwc compressor for fun as spin-off from a program I wrote to count fix length keywords in text files.

    Kwc take every time 6 bytes as "keyword" string and do not store double keywords in output but the unique ID from the already found keyword.

    The only novel thing I put in is what I called "gear encoding" what only give gain with unsorted but slowly incurring values ranges.

    It's written in Visual Basic 2008 and need Framework 3.5 to run.

    It run under Windows as GUI and under command line mode 32 and 64bit.

    Command line syntax:
    Compress: kwc c input output
    Decompress: kwc d input output

    Memory usage depends on input file size and unique keywords found I think round 0,5-1,5x times input file size for compression and half of it for decompression.

    It can often run recursive 3-6 times till gain stop.

    I didn't build in full error handling so when there are to much round 36mln+ unique keywords found in a input file it crash with an overflow. This often happened when input file is >1GB and/or random or already compressed.

    Download:
    http://www.metacompressor.com/download/kwc.zip

    Maybe Matt can tested it at the LTB and GCB so I can compare it against BPE2 what's also a dictionary type compressor.
    Last edited by Sportman; 21st January 2010 at 06:49.

  2. #2
    Tester
    Black_Fox's Avatar
    Join Date
    May 2008
    Location
    [CZE] Czechia
    Posts
    471
    Thanks
    26
    Thanked 9 Times in 8 Posts
    Enwik 8: 100 000 000 -> 54 097 740 (45,481 s), 2 510 921 unique keywords
    Enwik 8.kwc: 54 097 740 -> 53 960 841 (26,537 s), 8 616 190 unique keywords

    Enwik 9: 1 000 000 000 -> 532 622 518 (613,085 s), 10 587 596 unique keywords
    Enwik 9.kwc: crashed cca in half because of abovementioned problem (37M codewords)

    Did not try decompression. C2Q Q6600 @ 3.0GHz, Win 7 Pro x64
    Last edited by Black_Fox; 19th January 2010 at 15:27.
    I am... Black_Fox... my discontinued benchmark
    "No one involved in computers would ever say that a certain amount of memory is enough for all time? I keep bumping into that silly quotation attributed to me that says 640K of memory is enough. There's never a citation; the quotation just floats like a rumor, repeated again and again." -- Bill Gates

  3. #3
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    787
    Thanks
    64
    Thanked 274 Times in 192 Posts
    Black_fox thanks for testing.

    I removed that 37mln unique keywords overflow limit and replaced old version with it. It give now memory overflow by 47.9mln unique keywords under Windows 64bit even when there is ten times more memory available, this looks like .NET related.

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,501
    Thanks
    741
    Thanked 664 Times in 358 Posts
    Quote Originally Posted by Black_Fox View Post
    Enwik 8: 100 000 000 -> 54 097 740 (45,481 s), 2 510 921 unique keywords
    dict: 5 seconds, 52,795,972 bytes

  6. #6
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    38
    Thanked 168 Times in 84 Posts
    Here are my results on enwik8. RAM Disk used. Tested on AMD64 4000+, XP SP3.
    Code:
    tool       comp.     decomp.        size
    -----     ------     -------     ----------
    kwc       20.627      9.228      54 097 740
    dict       7.173      1.013      52 812 125
    But I'm curious. Bulat's result is 52 795 972 bytes. I have 3 different versions of dict. One have been taken from Haskell site, another one from here and last one from freearc.org. None of them gives me 52 795 972 bytes so the question is: or we have different enwiks or there is some secret dict version?

  7. #7
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    787
    Thanks
    64
    Thanked 274 Times in 192 Posts
    Matt, thanks for testing!

    I have also tested two versions of Dict (8/6/2007 and 3/13/2009), both give the same output size but different then the two mentioned output sizes. The 3/13/2009 Dict version with was a little slower then the older version.

    dict -1 74,710,608 bytes
    dict -f 54,763,991 bytes
    kwc c 54,097,740 bytes
    dict -p 53,897,083 bytes

    See link for timings and memory use:
    http://www.metacompressor.com/upload...estfile=enwik8

    Both Dict versions crash while compressing enwik9 with all three settings.

    Is there a compiled bpe2 version for Windows 32/64bit?
    http://mattmahoney.net/dc/bpe2v2.cpp

  8. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,501
    Thanks
    741
    Thanked 664 Times in 358 Posts
    1. i've used arc -mdict actually
    2. last update to dict was Aug 2 - fixed crashing bug
    3. there are internal settings that cannot be controlled in dict.exe but in -mdict:

    MinWeakChars = 20;
    MinLargeCnt = 2048;
    MinMediumCnt = 100;
    MinSmallCnt = 50;
    MinRatio = 4;

    case 'p': p->MinLargeCnt=8192; p->MinMediumCnt=400; p->MinSmallCnt=100; p->MinRatio=4; continue;
    case 'f': p->MinLargeCnt=2048; p->MinMediumCnt=100; p->MinSmallCnt= 50; p->MinRatio=0; continue;
    case 'c': p->MinWeakChars = parseInt (param+1, &error); continue;
    case 'l': p->MinLargeCnt = parseInt (param+1, &error); continue;
    case 'm': p->MinMediumCnt = parseInt (param+1, &error); continue;
    case 's': p->MinSmallCnt = parseInt (param+1, &error); continue;
    case 'r': p->MinRatio = parseInt (param+1, &error); continue;


    default settings are optimized to improve dict+lzma compression. by changing these settings we can improve compression somewhat. the best i can do

    D:\testing>arc create a -mdict:l1000:m39:s25:r1:33m enwik8
    Compressed 1 file, 100,000,000 => 50,263,682 bytes. Ratio 50.2%
    Compression time: cpu 4.13 secs, real 4.53 secs. Speed 22,080 kB/s

    making :r floating point will probably allow to break 50 mb barrier

  9. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,501
    Thanks
    741
    Thanked 664 Times in 358 Posts
    Quote Originally Posted by Sportman View Post
    dict -1 74,710,608 bytes
    dict -f 54,763,991 bytes
    dict -p 53,897,083 bytes
    you forget to test it w/o any switches
    Last edited by Bulat Ziganshin; 20th January 2010 at 15:25.

  10. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,501
    Thanks
    741
    Thanked 664 Times in 358 Posts
    D:\testing>arc create a -mdict enwik9
    Compressed 1 file, 1,000,000,000 => 506,229,029 bytes. Ratio 50.6%
    Compression time: cpu 44.57 secs, real 48.01 secs. Speed 20,829 kB/s


    D:\testing>arc create a -mdict:l1000:m39:s26:r1:33m enwik9
    Compressed 1 file, 1,000,000,000 => 484,991,633 bytes. Ratio 48.4%
    Compression time: cpu 48.22 secs, real 53.77 secs. Speed 18,599 kB/s


    i don't tried to optimize it for e9, just tried switches that was semi-optimal for e8

  11. #11
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    787
    Thanks
    64
    Thanked 274 Times in 192 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    you forget to test it w/o any switches
    Oops, quickly added:

    dict -1 74,710,608 bytes
    dict -f 54,763,991 bytes
    kwc c 54,097,740 bytes
    dict -p 53,897,083 bytes
    dict 52,812,712 bytes
    arc a -mdict 52,796,199 bytes
    arc a -mdict:l1000:m39:s26:r1:33m 50,292,278 bytes

    See link for timings and memory use:
    http://www.metacompressor.com/upload...estfile=enwik8

    Also without any switch both crash at enwik9

    I also added arc a -mdict and arc a -mdict:l1000:m39:s26:r1:33m at enwik8 and enwik9:
    http://www.metacompressor.com/upload...testfile=file4


    When I test Dict and Arc -mdict at rafale.bmp I get not so good compression results compared with Kwc:

    4,066,106 bytes dict -p
    3,487,657 bytes dict -1
    3,468,906 bytes arc a -mdict
    3,468,672 bytes dict
    3,259,972 bytes arc a -mdict:l1000:m39:s26:r1:33m
    1,849,159 bytes kwc c

    See link for timings and memory use:
    http://www.metacompressor.com/upload...estfile=rafale

    What kind of dictionary does Dict build?
    Last edited by Sportman; 20th January 2010 at 17:49.

Similar Threads

  1. a very simple transform for english.dic
    By willvarfar in forum Data Compression
    Replies: 8
    Last Post: 1st March 2010, 15:44
  2. Simple bytewise context mixing demo
    By Shelwien in forum Data Compression
    Replies: 11
    Last Post: 27th January 2010, 04:12
  3. Simple encryption (RC4 like)
    By encode in forum Forum Archive
    Replies: 37
    Last Post: 26th January 2008, 04:05
  4. rnd - simple pseudo random number generator
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 14th January 2008, 03:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •