Results 1 to 19 of 19

Thread: Kwe - keyword encoder

  1. #1
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts

    Kwe - keyword encoder

    Kwe version 0.0.0.3 - keyword encoder

    Kwe encode keywords:
    kwe e input output
    kwe d input output

    There are 4 optional options (must be used all at once):
    - keyword separator (default 32 = space)
    - minimal keyword length (default 2 = min.)
    - maximal keyword length (default 255 = max.)
    - keyword depth (default 255 = max.)

    Command line version, can work with .NET framework 4.8, Windows, Linux, Mac OS X.

    Very simple first version, there is no error handling and not well tested.

    Input can be any text file.
    Output can be compressed with an other archiver.
    Attached Files Attached Files
    Last edited by Sportman; 26th July 2020 at 02:23.

  2. #2
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    164
    Thanks
    31
    Thanked 64 Times in 40 Posts
    ‚ÄčIs separator must exist in file?

  3. #3
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by xezz View Post
    ‚ÄčIs separator must exist in file?
    Yes, Kwe use separator with not used bytes to navigate.

  4. #4
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Wrote from scratch my first ever C++ program, a little different logic then the VB.NET version, no optional options, C++ decode is probably compatable.

    Input:
    400,000,000 bytes - TS40.txt

    Output:
    357,560,964 bytes, 78.312 sec. - 15.904 sec., kwe 0.0.0.1 (32-bit) VB.NET 4.8
    357,579,332 bytes, 49.009 sec. - 06.691 sec., kwe 0.0.0.1 C++ GCC 8.1.0
    357,579,332 bytes, 39.996 sec. - 12.931 sec., kwe 0.0.0.1 C++ VS 2019

    357,579,332 bytes, 55.781 sec. - 3.178 sec., kwe 0.0.0.2 (32-bit) VB.NET 4.8
    357,579,332 bytes, 49.109 sec. - 2.390 sec., kwe 0.0.0.2 VB.NET 4.8
    357,579,332 bytes, 46.695 sec. - 1.749 sec., kwe 0.0.0.2 C++ GCC 8.1.0
    357,579,332 bytes, 32.947 sec. - 1.795 sec., kwe 0.0.0.2 C++ VS 2019

    357,537,932 bytes, 43.998 sec. - 2.443 sec., kwe 0.0.0.3 VB.NET 4.8
    357,537,932 bytes, 37.775 sec. - 1.577 sec., kwe 0.0.0.3 C++ GCC 8.1.0
    357,537,932 bytes, 34.619 sec. - 1.635 sec., kwe 0.0.0.3 C++ VS 2019
    Last edited by Sportman; 26th July 2020 at 02:28.

  5. #5
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Created a GCC 9.3.0 Linux version.
    Last edited by Sportman; 24th July 2020 at 04:21.

  6. #6
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    127
    Thanks
    38
    Thanked 13 Times in 9 Posts
    Hm... Today Google Chrome does not recommend to download files from this site...

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    Google doesn't like executables.

  8. #8
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Created an Apple clang 11.0.3 Mac OS X version.
    Last edited by Sportman; 24th July 2020 at 04:21.

  9. #9
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    127
    Thanks
    38
    Thanked 13 Times in 9 Posts
    > 357,579,332 bytes...

    Weakly... If simply replace 66 most used words like "the", "they", "and" with 66 not used bytes, we can decrease TS40.txt on ~ 40 Mb. But XWRT decreases its size up to 135 Mb! How XWRT does it?

    Let one programmer use own original algorithms, and another programmer grab sources XWRT, zstd, ... and win the GDC price,
    then it seems that in GDC something is wrong...

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    > Let one programmer use own original algorithms, and another programmer grab sources XWRT, zstd, ... and win the GDC price, then it seems that in GDC something is wrong...

    1. Testing on filtered english plaintext is the only compromise I could think of.
    Since there're several open-source implementations of preprocessors for it, and its still quite relevant for actual storage tasks.
    We did consider providing already preprocessed text (there's a risk of winning by reverse-engineering our preprocessing algorithm and using a better one),
    or choosing non-english text (eg. chinese; there's a risk of winning simply by implementing a custom preprocessor).
    I think english text is a good tradeoff - its not that hard to integrate an open-source preprocessor, and its still possible
    to write a new one from scratch without reverse-engineering skills (which are not really relevant for compression contest).

    2. GDC has 24 prizes (1st/2nd places in 12 categories), and text is one of 3-4 data types. Just choose a different category if you don't like text preprocessing.

  11. Thanks:

    lz77 (19th July 2020)

  12. #11
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Released Kwe 0.0.0.2 VB.NET with same logic as the C++ version and added the faster timings.

  13. #12
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    127
    Thanks
    38
    Thanked 13 Times in 9 Posts
    Quote Originally Posted by Shelwien View Post
    2. GDC has 24 prizes (1st/2nd places in 12 categories), and text is one of 3-4 data types. Just choose a different category if you don't like text preprocessing.
    I think, there must be competition with ~ 30 sec. for compress+uncompress of 1 Gb data of unknown format, so that participants can show the power of their algorithms instead of making a compressor for one particular file.

    By the way: today Chrome does not swear when downloading files from this site. Yesterday it thought this site might have been hacked.

  14. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    > I think, there must be competition with ~ 30 sec. for compress+uncompress of 1 Gb data of unknown format,
    > so that participants can show the power of their algorithms instead of making a compressor for one particular file.

    Ideally yes, but practically it means having to integrate custom handlers for all popular formats.
    I think its too much work for a contest.

    > By the way: today Chrome does not swear when downloading files from this site.
    > Yesterday it thought this site might have been hacked.

    Maybe somebody downloaded a file with chrome, that google didn't like (using exepacker or something).
    They have even more false positives than worst AVs.

  15. #14
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by lz77 View Post
    >357,579,332 bytes...

    Weakly...
    I did a test with max depth 65,535 (not supported now) and got 254,532,577 bytes as output but it took almost 8 hours because brute force is too slow for this depth and needs smarter logic.

  16. #15
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Updated Windows, Linux, Mac OS X C++ version to 0.0.0.2, optional options supported now, updated timings.

  17. #16
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Updated Windows, Linux, Mac OS X C++ version to 0.0.0.3, improved encoding, updated timings.

  18. #17
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    127
    Thanks
    38
    Thanked 13 Times in 9 Posts
    Quote Originally Posted by Sportman View Post
    I did a test with max depth 65,535 (not supported now) and got 254,532,577 bytes as output...
    But the file 254,532,577 bytes size is compressible by LZ77? What is the ratio with blzpack -2 -b254m?

  19. #18
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by lz77 View Post
    What is the ratio with blzpack -2 -b254m?
    I did not kept the source, binary or output from the 65535 depth encoder, I wrote a new one but has a different output, in both can be mistakes:

    Code:
    400,000,000 bytes, TS40.txt                168,727,011 bytes, 3.304 sec., blzpack -2 -b 254m
    399,987,375 bytes, TS40.txt.rle            168,725,981 bytes, 3.306 sec., blzpack -2 -b 254m
    357,537,932 bytes, TS40.txt.kwe255         180,405,740 bytes, 3.418 sec., blzpack -2 -b 254m
    356,515,428 bytes, TS40.txt.kwe255.rle     181,210,716 bytes, 3.421 sec., blzpack -2 -b 254m
    336,321,049 bytes, TS40.txt.kwe65535b      191,919,626 bytes, 3.236 sec., blzpack -2 -b 254m
    317,829,482 bytes, TS40.txt.kwe65535b.rle  191,095,389 bytes, 3.174 sec., blzpack -2 -b 254m
    269,717,172 bytes, TS40.txt.kwe65535a      172,431,277 bytes, 2.638 sec., blzpack -2 -b 254m
    251,039,562 bytes, TS40.txt.kwe65535a.rle  171,626,235 bytes, 2.562 sec., blzpack -2 -b 254m
    Last edited by Sportman; 28th July 2020 at 09:27.

  20. Thanks:

    lz77 (27th July 2020)

  21. #19
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    127
    Thanks
    38
    Thanked 13 Times in 9 Posts
    But lzturbo -p0 -32 -b400 compresses better without any preprocessor... We must hard work to win Euro 3000 in Rapid Compression.
    I think, after simple quick preprocessing lzturbo can compress TS40.txt up to 110-115 megabytes.
    Last edited by lz77; 27th July 2020 at 15:50.

Similar Threads

  1. krc - keyword recursive compressor
    By Sportman in forum Data Compression
    Replies: 48
    Last Post: 13th June 2019, 03:15
  2. Kwc – very simple keyword compressor
    By Sportman in forum Data Compression
    Replies: 10
    Last Post: 20th January 2010, 16:06
  3. PPMX - a new PPM encoder
    By encode in forum Data Compression
    Replies: 14
    Last Post: 30th November 2008, 16:03
  4. about files to test encoder
    By Krzysiek in forum Data Compression
    Replies: 3
    Last Post: 9th July 2008, 21:22
  5. balz v1.00 - new LZ77 encoder is here!
    By encode in forum Forum Archive
    Replies: 61
    Last Post: 17th April 2008, 22:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •