Results 1 to 26 of 26

Thread: Text-ish preprocessors

  1. #1
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts

    Question Text-ish preprocessors

    I'm looking for whatever text-oriented algorithm/program you may know, specifically pre-processors that are designed to help general codecs like lzma, as I would like to test and compare them.

    When I say text, I'm not necessarily referring to the textbook definition of a file produced by a human for another human like book1 (bc it's not the 70's anymore and nobody saves a book in a .txt ), but to the broader definition of a computer-human generated file consisting on only letters of a given alphabet, as opposed to a binary file which can have all the bytes from 00 to FF. For example, source code, logs, XML, HTML, PDF content before compression (there's a lot of formatting there), etc. They can have 'natural' human words and sentences; also content on a formal computer language.


    What I know as of today (and this is the part where somebody might want to correct me):

    AFAIK there are at least 3 types of program that can improve text compression:

    * 'Dictionary' algorithms use a list of common words either pre-compiled or calculated on-the-fly from the file and replace those words with sorter codes. Example: XWRT
    * 'LZP' is a Lempel-Ziv that also cares about context
    * and the Burrows-Wheeler Transform which is actually a sorting mechanism that rearranges the data so it can be better compressed
    * Did I miss something?


    Do you guys know any particularly good implementation of any of those algorithms (or some other too)? It's imperative that they can be used as a standalone process (w/o an entropy coder) so as to feed their output to a general-purpose compressor like lzma. And I would prefer if they could be in the form of source code to compile them all with the same compiler, as I want to measure their performance.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,717
    Thanks
    271
    Thanked 1,185 Times in 656 Posts
    http://nishi.dreamhosters.com/u/liptify.rar (DRT ancestor)
    http://nishi.dreamhosters.com/u/lpaq9m.rar (DRT)
    https://github.com/inikep/XWRT
    http://nishi.dreamhosters.com/u/filter.rar (Edgar Binder's preprocessor from DC)
    http://freearc.dreamhosters.com/dict.zip (Bulat's dict preprocessor)

    BWT isn't really compatible with LZ compression.
    LZP preprocessing only makes sense before BWT (mostly as speed/compression tradeoff).

  3. Thanks:

    Gonzalo (23rd November 2019)

  4. #3
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    143
    Thanks
    46
    Thanked 40 Times in 29 Posts
    https://github.com/mathieuchartier/mcm mcm -filter=dict -store
    https://github.com/flanglet/kanzi-cpp kanzi -t text -e none

    I am sure there are plenty more.

  5. Thanks:

    Gonzalo (23rd November 2019)

  6. #4
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Thanks for the input! Some remarks:

    * liptify and DRT are both crashing. Can't use them
    * XWRT is one of the best but sometimes it behaves like there is nothing to do with the file and outputs a slightly bigger .xwrt. There are other odd behaviors, especially when extracting, so I'm considering it unstable (since it's no longer developed)
    * filter is of negligible help on my tests. I guess it's tuned for a particular file or something
    * FA's dict and lzp are among the truly helpful ones, although it depends on the specific file.
    * I can't compile mcm >>"2 warnings and 4 errors generated."
    * kanzi filters doesn't seem to improve compression at all, and when they do, it's very little compared to FA's dict or lzp


    * pcompress shipped with a lot of filters and detection code but I also can't compile it. It depends on an outdated library.

    I'll keep looking for some others too

  7. #5
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    240
    Thanks
    112
    Thanked 113 Times in 69 Posts
    You can also try cmix -s dictionary/english.dic input output
    http://www.byronknoll.com/cmix.html

  8. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,717
    Thanks
    271
    Thanked 1,185 Times in 656 Posts
    > liptify and DRT are both crashing. Can't use them

    Can you preprocess a small english text (like book1)? Does it work?
    liptify has a test script and I explained how to use DRT (there's a readme too).
    I suspect that you just didn't get them to work since there's no usage text,
    and there're two steps.

    > * filter is of negligible help on my tests. I guess it's tuned for a particular file or something

    FILTER.EXE f -b1024 BOOK1 BOOK1.f
    converts BOOK1 from 768771 to 552973 bytes.

    > * I can't compile mcm >>"2 warnings and 4 errors generated."

    I've also got some errors. Since here we only need the preprocessor,
    I just removed the problematic lines (replaced with "return 0") and it worked.
    "mcm.exe -filter=dict -store BOOK1" generated file of size 509493.

  9. #7
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    143
    Thanks
    46
    Thanked 40 Times in 29 Posts
    What kind of files are you compressing ? Do you have examples ?

    For pure english text and markup, see below:

    Code:
    liblzma 5.2.2
    mcm 0.83
    xwrt 3.4
    cmix 18
    
    -rw-rw-r-- 1 fred fred   252602 Nov 22 16:44 /disk1/ws/book1.dic.lzma
    -rw-rw-r-- 1 fred fred   251794 Nov 22 17:04 /disk1/ws/book1.f.lzma
    -rw-rw-r-- 1 fred fred   241663 Nov 22 17:16 /disk1/ws/book1.knz.lzma
    -rw-rw-r-- 1 fred fred   261032 Jun 22  2013 /disk1/ws/book1.lzma
    -rw-rw-r-- 1 fred fred   248257 Nov 22 17:10 /disk1/ws/book1.mcm.lzma
    -rw-rw-r-- 1 fred fred   242697 Nov 22 17:31 /disk1/ws/book1.xwrt.lzma
    -rw-rw-r-- 1 fred fred   234870 Nov 22 17:44 /disk1/ws/book1.cmx.lzma
    -rw-rw-r-- 1 fred fred   234294 Nov 23 08:51 /disk1/ws/book1.drt.lzma
    
    -rw-rw-r-- 1 fred fred 24773569 Nov 20 19:46 /disk1/ws/enwik8.dic.lzma
    -rw-rw-r-- 1 fred fred 25322624 Nov 22 17:05 /disk1/ws/enwik8.f.lzma
    -rw-rw-r-- 1 fred fred 24093239 Nov 22 17:09 /disk1/ws/enwik8.knz.lzma
    -rw-rw-r-- 1 fred fred 26371635 Jun 29 13:56 /disk1/ws/enwik8.lzma
    -rw-rw-r-- 1 fred fred 24362033 Nov 22 17:12 /disk1/ws/enwik8.mcm.lzma
    -rw-rw-r-- 1 fred fred 23180188 Nov 22 17:31 /disk1/ws/enwik8.xwrt.lzma
    -rw-rw-r-- 1 fred fred 23685080 Nov 22 17:44 /disk1/ws/enwik8.cmx.lzma
    -rw-rw-r-- 1 fred fred 23744945 Nov 23 08:52 /disk1/ws/enwik8.drt.lzma

    For kanzi, the default block size is too small for big files (1MB), give a bigger one
    kanzi.exe -c -i r:\enwik8 -f -e none -b 100m -t text
    enwik8: 100000000 => 51793684 bytes in 826 ms
    book1: 768771 => 396684 bytes in 53 ms

    I guess the results of the different text processors will depend on the type of input (pure text, mix text/bin, english or not, UTF-8/ASCII, ...). For instance xwrt wins on enwik8 because it is an XML processor while the others are just for text AFAIK.
    How about the compression/decompression speedup ?
    Last edited by hexagone; 23rd November 2019 at 20:55. Reason: Fixed drt results

  10. #8
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    @shelwien: Yup. I just imagined that if they got me a
    Code:
    Segmentation fault (core dumped)
    instead of at least exiting gracefully they had to be faulty so I didn't even read the docs past that point. My bad. They work as intended

    @byronknoll: Thanks! Didn't though of cmix. BTW: Is it supposed to run for a long period using "-s"? I was running it on a 50mb file and I just cancelled it after several minutes. I was using a 'colaboratory' machine so I didn't get the CPU usage. Size of input.cmix was of 0 while running, 5 after cancelling

    @hexagone: I use a few files. enwik8 is one of them. enwik9 comes in when I finish the script. I also have 'sda4', which is the first 50000000 (50mb) of
    Code:
    sudo cat sda4 | strings >sda4
    There is '7z', generated from the latest sources this way:
    Code:
    ##pseudo
    wget https://www.7-zip.org/a/7z1900-src.7z
    7z x 7z1900-src.7z
    7z a -ttar $extracted_files
    strings TAR_archive >7z
    And I will probably add a few more or stop using some of these, depending on whether they are representative of real world data.

    I will enlarge kanzi's block size and come back with hard data. Thanks for the tip!

    >>How about the compression/decompression speedup ?
    Not yet. But I plan to measure it when I got all the preprocessors up and running

    EDIT: What is *.f ? and .dic ?

  11. #9
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    143
    Thanks
    46
    Thanked 40 Times in 29 Posts
    .f from filter.exe
    .dic from dict.exe

  12. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,717
    Thanks
    271
    Thanked 1,185 Times in 656 Posts
    There's also nncp preprocess: https://bellard.org/nncp/

    Also this: https://sites.google.com/site/shelwien/dur05log_v0.rar (hacked durilca)
    dur_flt.exe e -t2 -o2 BOOK1 -> durilca.dmp contains preprocessed data

  13. #11
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    998
    Thanks
    602
    Thanked 410 Times in 308 Posts
    > What kind of files are you compressing ? Do you have examples ?

    I have smaller textual files in my testset and sometimes scores of preprocessor usage could be different than for bigger files.
    For example R.DOC

    According to preprocessors there is also DRT preprocessor - it was a part of lpaq9m package - in attached ZIP
    Attached Files Attached Files

  14. #12
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    >I have smaller textual files in my testset and sometimes scores of preprocessor usage could be different than for bigger files.
    >For example R.DOC

    Totally! That's why I'm writing a test script rather than just sharing my findings (I will when I finish)
    I want everyone to find out by themselves which preprocessor is better suited for their needs. It should be easy enough once it's finished. And I believe if sufficient people use it, we could get an approximation as to which ones are generally better than the rest. An ultimate version could be even run as a nightly benchmark and use the collected data to find groups of text inside the text group. For example, one method could be best for c++ code, another one for precomped pdfs. Anyway, we're to far away from that now. Just keep throwing me programs, and I'll keep adding them to the test.


    @all: Should I write the test and then release it, or would you prefer to have an early version available for editing?

  15. #13
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    143
    Thanks
    46
    Thanked 40 Times in 29 Posts
    Added DRT results to previous post.

    Here on dickens:

    Post processing:

    -rw-rw-r-- 1 fred fred 10192446 Nov 23 09:02 /disk1/ws/lzma/dickens
    -rw-rw-r-- 1 fred fred 6225768 Nov 23 09:06 /disk1/ws/lzma/dickens.cmx
    -rw-rw-r-- 1 fred fred 5381646 Nov 23 08:55 /disk1/ws/lzma/dickens.dict
    -rw-rw-r-- 1 fred fred 6174584 Nov 23 08:54 /disk1/ws/lzma/dickens.drt
    -rw-rw-r-- 1 fred fred 7423169 Nov 23 08:56 /disk1/ws/lzma/dickens.f
    -rw-rw-r-- 1 fred fred 4572828 Nov 23 08:54 /disk1/ws/lzma/dickens.knz
    -rw-rw-r-- 1 fred fred 6178347 Nov 23 08:55 /disk1/ws/lzma/dickens.mcm
    -rw-rw-r-- 1 fred fred 3964126 Nov 23 08:55 /disk1/ws/lzma/dickens.xwrt


    Post compression:

    -rw-rw-r-- 1 fred fred 2616433 Nov 23 09:06 dickens.cmx.lzma
    -rw-rw-r-- 1 fred fred 2612178 Nov 23 08:55 dickens.dict.lzma
    -rw-rw-r-- 1 fred fred 2614075 Nov 23 08:54 dickens.drt.lzma
    -rw-rw-r-- 1 fred fred 2713840 Nov 23 08:56 dickens.f.lzma
    -rw-rw-r-- 1 fred fred 2580741 Nov 23 08:54 dickens.knz.lzma
    -rw-rw-r-- 1 fred fred 2831189 Nov 23 09:02 dickens.lzma
    -rw-rw-r-- 1 fred fred 2658463 Nov 23 08:55 dickens.mcm.lzma
    -rw-rw-r-- 1 fred fred 2541032 Nov 23 08:55 dickens.xwrt.lzma


    xwrt is the winner here.

    Now, just to see what happens on non-text files.
    Post processing:

    -rw-rw-r-- 1 fred fred 4149414 Nov 4 2002 rafale.bmp
    -rw-rw-r-- 1 fred fred 4149433 Nov 23 09:21 rafale.cmx
    -rw-rw-r-- 1 fred fred 3468672 Nov 23 09:17 rafale.dict
    -rw-rw-r-- 1 fred fred 7202392 Nov 23 09:15 rafale.drt
    -rw-rw-r-- 1 fred fred 4149186 Nov 23 09:14 rafale.f
    -rw-rw-r-- 1 fred fred 4149436 Nov 23 09:14 rafale.knz
    -rw-rw-r-- 1 fred fred 4255197 Nov 23 09:14 rafale.mcm
    -rw-rw-r-- 1 fred fred 4149436 Nov 23 09:15 rafale.xwrt


    DRT does not have a text detector and expands the file significantly.
    Dict is the only one to compress this file which means it is not purely focused on text, maybe more LZ like ?
    All the others (except mcm) keep the size similar to the original.

    Post compression:


    -rw-rw-r-- 1 fred fred 976257 Nov 4 2002 rafale.bmp.lzma
    -rw-rw-r-- 1 fred fred 999271 Nov 23 09:21 rafale.cmx.lzma
    -rw-rw-r-- 1 fred fred 1041599 Nov 23 09:17 rafale.dict.lzma
    -rw-rw-r-- 1 fred fred 1076639 Nov 23 09:15 rafale.drt.lzma
    -rw-rw-r-- 1 fred fred 996625 Nov 23 09:14 rafale.f.lzma
    -rw-rw-r-- 1 fred fred 975986 Nov 23 09:14 rafale.knz.lzma
    -rw-rw-r-- 1 fred fred 1005676 Nov 23 09:14 rafale.mcm.lzma
    -rw-rw-r-- 1 fred fred 976249 Nov 23 09:15 rafale.xwrt.lzma


    ​Only kanzi and xwrt did not affect overall compression.
    Last edited by hexagone; 23rd November 2019 at 21:00. Reason: Fixed DRT results

  16. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,717
    Thanks
    271
    Thanked 1,185 Times in 656 Posts
    I think you did something wrong with DRT (didn't unpack the dictionary?).
    "DRT.exe dickens dickens.drt" output is 6,174,584 bytes.

  17. Thanks:

    hexagone (23rd November 2019)

  18. #15
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    143
    Thanks
    46
    Thanked 40 Times in 29 Posts
    hmmm. right.
    Dictionary in the wrong place. Let me update the results.

  19. #16
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Here's what I got so far:

    198265    book1.cmix.ppmd
    208857 book1.dict.ppmd
    208897 book1.dict+lzp.ppmd
    208915 book1.rep+dict+lzp.ppmd
    209882 book1.xwrt.ppmd
    210223 book1.ppmd <=====
    236547 book1.cmix.7zf
    240632 book1.xwrt.7zf
    240960 book1.BWTS.knz.7zf
    241017 book1.BWT.knz.7zf
    242672 book1.TEXT.knz.7zf
    254416 book1.rep+dict+lzp.7zf
    254423 book1.dict+lzp.7zf
    254429 book1.dict.7zf
    263309 book1.RLT.knz.7zf
    263317 book1.7zf <=====



    2225956 dickens.cmix.ppmd
    2258411 dickens.xwrt.ppmd
    2273225 dickens.rep+dict+lzp.ppmd
    2274168 dickens.dict+lzp.ppmd
    2275704 dickens.TEXT.knz.ppmd
    2278051 dickens.dict.ppmd
    2351788 dickens.rep.ppmd
    2353958 dickens.lzp.ppmd
    2360830 dickens.ppmd <=====
    2519521 dickens.BWT.knz.7zf
    2520073 dickens.BWTS.knz.7zf
    2528298 dickens.xwrt.7zf
    2588021 dickens.TEXT.knz.7zf
    2626775 dickens.rep+dict+lzp.7zf
    2627901 dickens.dict+lzp.7zf
    2629002 dickens.dict.7zf
    2629682 dickens.cmix.7zf
    2863786 dickens.rep.7zf
    2863846 dickens.lzp.7zf
    2864794 dickens.7zf <=====



    22319831 enwik8.cmix.ppmd
    22541629 enwik8.xwrt.7zf
    22608777 enwik8.xwrt.ppmd
    22877749 enwik8.cmix.7zf
    22977039 enwik8.BWT.knz.7zf
    22977843 enwik8.BWTS.knz.7zf
    23140778 enwik8.TEXT.knz.ppmd
    23446494 enwik8.TEXT.knz.7zf
    23851132 enwik8.rep+dict+lzp.ppmd
    23896953 enwik8.dict+lzp.ppmd
    23908216 enwik8.dict.ppmd
    24182746 enwik8.lzp.ppmd
    24208741 enwik8.rep.ppmd
    24212985 enwik8.RLT.knz.ppmd
    24217656 enwik8.ppmd <=====
    24410126 enwik8.rep+dict+lzp.7zf
    24436741 enwik8.dict.7zf
    24438311 enwik8.dict+lzp.7zf
    25112239 enwik8.RLT.knz.7zf
    25112452 enwik8.7zf <=====



    I put all but one kanzi helpers just to see what happened. I'll delete the irrelevant ones later.

    *.ppmd is "fazip ppmd IN OUT"
    *.7zf is fastlzma2 max compression

    I will probably change this to lzbench, turbobench or some other when I have all the preprocessors included. This is just a proof of concept.

    CMIX is a clear winner so far if used in conjunction with ppmd. BWT is good for flzma2.

    I believe I should make a LIPTify dictionary for each major group; C++ code, Pascal code, HTML, etc.

    Some helper are better applied together. As it is extremely impractical to try all permutations, I'm open to suggestions about which ones should be paired...
    Last edited by Gonzalo; 23rd November 2019 at 21:45. Reason: Added enwik8 results - Edited out the non-helpful entries

  20. #17
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    143
    Thanks
    46
    Thanked 40 Times in 29 Posts
    Just to be clear, for kanzi only -t text is relevant. Other transforms do completely different things (eg. SRT, RANK or MTFT are used after a BWT, not in isolation).

  21. #18
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Quote Originally Posted by hexagone View Post
    Just to be clear, for kanzi only -t text is relevant.
    Yet some of them are better suited for some files... See this:

    210241    book1.ZRLT.knz.ppmd
    210255 book1.RLT.knz.ppmd
    211340 book1.TEXT.knz.ppmd

    ....

    240960 book1.BWTS.knz.7zf
    241017 book1.BWT.knz.7zf
    242672 book1.TEXT.knz.7zf


    ...

    2519521 dickens.BWT.knz.7zf
    2520073 dickens.BWTS.knz.7zf
    2528298 dickens.xwrt.7zf
    2588021 dickens.TEXT.knz.7zf

    ...

    22977039 enwik8.BWT.knz.7zf
    22977843 enwik8.BWTS.knz.7zf
    23140778 enwik8.TEXT.knz.ppmd
    23446494 enwik8.TEXT.knz.7zf




    Which ones would you chain together? I'll add them to the next iteration
    Last edited by Gonzalo; 23rd November 2019 at 21:47. Reason: Added enwik8

  22. #19
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    143
    Thanks
    46
    Thanked 40 Times in 29 Posts
    Look at the help command (-h) for the definitions of compression levels. But if you want to compare text preprocessors only, just use "-t text -e none", the rest is not relevant.

  23. #20
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Quote Originally Posted by hexagone View Post
    Look at the help command (-h) for the definitions of compression levels. But if you want to compare text preprocessors only, just use "-t text -e none", the rest is not relevant.
    Yes, I did. I realize that -t TEXT is designed for texts, but look at the numbers. BWT is always a better helper to flzma2, for example. That's why I still include it on the test.

    Are you kanzi's author? Maybe you could shed some light over which transforms are good working together, like FA's dict+lzp, for example...

  24. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,717
    Thanks
    271
    Thanked 1,185 Times in 656 Posts
    BWT is not a "helper", its a whole another compression algorithm, which is better than lzma for texts.
    But there's no point in compressing BWT output with lzma (there're specialized BWT postcoders which are both faster and better).

  25. #22
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Quote Originally Posted by Shelwien View Post
    BWT is not a "helper", its a whole another compression algorithm, which is better than lzma for texts.
    But there's no point in compressing BWT output with lzma (there're specialized BWT postcoders which are both faster and better).
    ​Could those be MTF>>Block Sorting? In any case, even if it's not designed to be a 'helper', it does help...

  26. #23
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    143
    Thanks
    46
    Thanked 40 Times in 29 Posts
    Of course you are free to compare whatever preprocessors work for your files. But BWT is a completely different beast and, for the record, comparing it to text based preprocessors is comparing apples to oranges. So maybe you can split your study into #1 purely text preprocessors (there is a great list in this thread) and #2 any preprocessors that may improve compression of your files. That would be more fair that mixing all results in one list.
    As for transform sequences, you could try -t TEXT+BWT, -t TEXT+BWT+RANK+ZRLT, -t TEXT+BWT+SRT+ZRLT but the results also depend on the entropy coder you choose.

  27. Thanks:

    Gonzalo (23rd November 2019)

  28. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,717
    Thanks
    271
    Thanked 1,185 Times in 656 Posts
    "block sorting" is BWT. MTF is "symbol ranking".
    BWT|MTF|EC is an old scheme popularized by bzip2, but its not especially good.
    By "specialized coders" I meant something like QLFC,DC or https://github.com/loxxous/Behemoth-Rank-Coding

    > even if it's not designed to be a 'helper', it does help...

    paq8px would also "help" in that sense.
    That is, book1.paq8px.lzma would be still much smaller than book1.lzma.
    But based on compression and speed its still better to use it on its own rather than as preprocessor.
    Same applies to BWT.

  29. Thanks:

    Gonzalo (23rd November 2019)

  30. #25
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    So I took your advice and striped away all of kanzi's transforms except TEXT. I also added bulat's lzp after kanzi, xwrt and cmix. Seems like a good idea:


    2217641 dickens.cmix.lzp.ppmd
    2225956 dickens.cmix.ppmd
    2253930 dickens.xwrt.lzp.ppmd
    2258411 dickens.xwrt.ppmd
    2272353 dickens.knz.lzp.ppmd
    2273225 dickens.rep+dict+lzp.ppmd
    2274168 dickens.dict+lzp.ppmd
    2275704 dickens.knz.ppmd
    2278051 dickens.dict.ppmd
    2351788 dickens.rep.ppmd
    2353958 dickens.lzp.ppmd
    2360830 dickens.ppmd
    2527744 dickens.xwrt.lzp.7zf
    2528298 dickens.xwrt.7zf
    2587338 dickens.knz.lzp.7zf
    2588013 dickens.knz.7zf
    2626775 dickens.rep+dict+lzp.7zf
    2627901 dickens.dict+lzp.7zf
    2629002 dickens.dict.7zf
    2629080 dickens.cmix.lzp.7zf
    2629682 dickens.cmix.7zf
    2863786 dickens.rep.7zf
    2863846 dickens.lzp.7zf
    2864794 dickens.7zf


    For book1, OTOH, it hurts compression, but only by some ~20 to ~30 bytes

  31. Thanks:

    hexagone (24th November 2019)

  32. #26
    Member
    Join Date
    Apr 2018
    Location
    Indonesia
    Posts
    74
    Thanks
    15
    Thanked 5 Times in 5 Posts
    Quote Originally Posted by byronknoll View Post
    You can also try cmix -s dictionary/english.dic input output
    http://www.byronknoll.com/cmix.html

    you can also try pa8lab_2 -s6 input
    https://encode.su/threads/3239-paq8lab-1-0-archiver

Similar Threads

  1. First experiment: simple LZP / ROLZ ish toy
    By FunkyBob in forum Data Compression
    Replies: 21
    Last Post: 5th May 2019, 21:52
  2. Replies: 13
    Last Post: 17th May 2014, 07:11
  3. Preprocessors and filters and Nanozip??
    By kampaster in forum Data Compression
    Replies: 18
    Last Post: 9th July 2010, 21:42
  4. Text Detection
    By Simon Berger in forum Data Compression
    Replies: 15
    Last Post: 30th May 2009, 10:58
  5. Replies: 33
    Last Post: 24th October 2007, 13:39

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •