I'm looking for whatever text-oriented algorithm/program you may know, specifically pre-processors that are designed to help general codecs like lzma, as I would like to test and compare them.
When I say text, I'm not necessarily referring to the textbook definition of a file produced by a human for another human like book1 (bc it's not the 70's anymore and nobody saves a book in a .txt ), but to the broader definition of a computer-human generated file consisting on only letters of a given alphabet, as opposed to a binary file which can have all the bytes from 00 to FF. For example, source code, logs, XML, HTML, PDF content before compression (there's a lot of formatting there), etc. They can have 'natural' human words and sentences; also content on a formal computer language.
What I know as of today (and this is the part where somebody might want to correct me):
AFAIK there are at least 3 types of program that can improve text compression:
* 'Dictionary' algorithms use a list of common words either pre-compiled or calculated on-the-fly from the file and replace those words with sorter codes. Example: XWRT
* 'LZP' is a Lempel-Ziv that also cares about context
* and the Burrows-Wheeler Transform which is actually a sorting mechanism that rearranges the data so it can be better compressed
* Did I miss something?
Do you guys know any particularly good implementation of any of those algorithms (or some other too)? It's imperative that they can be used as a standalone process (w/o an entropy coder) so as to feed their output to a general-purpose compressor like lzma. And I would prefer if they could be in the form of source code to compile them all with the same compiler, as I want to measure their performance.
* liptify and DRT are both crashing. Can't use them
* XWRT is one of the best but sometimes it behaves like there is nothing to do with the file and outputs a slightly bigger .xwrt. There are other odd behaviors, especially when extracting, so I'm considering it unstable (since it's no longer developed)
* filter is of negligible help on my tests. I guess it's tuned for a particular file or something
* FA's dict and lzp are among the truly helpful ones, although it depends on the specific file.
* I can't compile mcm >>"2 warnings and 4 errors generated."
* kanzi filters doesn't seem to improve compression at all, and when they do, it's very little compared to FA's dict or lzp
* pcompress shipped with a lot of filters and detection code but I also can't compile it. It depends on an outdated library.
> liptify and DRT are both crashing. Can't use them
Can you preprocess a small english text (like book1)? Does it work?
liptify has a test script and I explained how to use DRT (there's a readme too).
I suspect that you just didn't get them to work since there's no usage text,
and there're two steps.
> * filter is of negligible help on my tests. I guess it's tuned for a particular file or something
FILTER.EXE f -b1024 BOOK1 BOOK1.f
converts BOOK1 from 768771 to 552973 bytes.
> * I can't compile mcm >>"2 warnings and 4 errors generated."
I've also got some errors. Since here we only need the preprocessor,
I just removed the problematic lines (replaced with "return 0") and it worked.
"mcm.exe -filter=dict -store BOOK1" generated file of size 509493.
What kind of files are you compressing ? Do you have examples ?
For pure english text and markup, see below:
Code:
liblzma 5.2.2
mcm 0.83
xwrt 3.4
cmix 18
-rw-rw-r-- 1 fred fred 252602 Nov 22 16:44 /disk1/ws/book1.dic.lzma
-rw-rw-r-- 1 fred fred 251794 Nov 22 17:04 /disk1/ws/book1.f.lzma
-rw-rw-r-- 1 fred fred 241663 Nov 22 17:16 /disk1/ws/book1.knz.lzma
-rw-rw-r-- 1 fred fred 261032 Jun 22 2013 /disk1/ws/book1.lzma
-rw-rw-r-- 1 fred fred 248257 Nov 22 17:10 /disk1/ws/book1.mcm.lzma
-rw-rw-r-- 1 fred fred 242697 Nov 22 17:31 /disk1/ws/book1.xwrt.lzma
-rw-rw-r-- 1 fred fred 234870 Nov 22 17:44 /disk1/ws/book1.cmx.lzma
-rw-rw-r-- 1 fred fred 234294 Nov 23 08:51 /disk1/ws/book1.drt.lzma
-rw-rw-r-- 1 fred fred 24773569 Nov 20 19:46 /disk1/ws/enwik8.dic.lzma
-rw-rw-r-- 1 fred fred 25322624 Nov 22 17:05 /disk1/ws/enwik8.f.lzma
-rw-rw-r-- 1 fred fred 24093239 Nov 22 17:09 /disk1/ws/enwik8.knz.lzma
-rw-rw-r-- 1 fred fred 26371635 Jun 29 13:56 /disk1/ws/enwik8.lzma
-rw-rw-r-- 1 fred fred 24362033 Nov 22 17:12 /disk1/ws/enwik8.mcm.lzma
-rw-rw-r-- 1 fred fred 23180188 Nov 22 17:31 /disk1/ws/enwik8.xwrt.lzma
-rw-rw-r-- 1 fred fred 23685080 Nov 22 17:44 /disk1/ws/enwik8.cmx.lzma
-rw-rw-r-- 1 fred fred 23744945 Nov 23 08:52 /disk1/ws/enwik8.drt.lzma
For kanzi, the default block size is too small for big files (1MB), give a bigger one
kanzi.exe -c -i r:\enwik8 -f -e none -b 100m -t text
enwik8: 100000000 => 51793684 bytes in 826 ms
book1: 768771 => 396684 bytes in 53 ms
I guess the results of the different text processors will depend on the type of input (pure text, mix text/bin, english or not, UTF-8/ASCII, ...). For instance xwrt wins on enwik8 because it is an XML processor while the others are just for text AFAIK.
How about the compression/decompression speedup ?
Last edited by hexagone; 23rd November 2019 at 20:55.
Reason: Fixed drt results
@shelwien: Yup. I just imagined that if they got me a
Code:
Segmentation fault (core dumped)
instead of at least exiting gracefully they had to be faulty so I didn't even read the docs past that point. My bad. They work as intended
@byronknoll: Thanks! Didn't though of cmix. BTW: Is it supposed to run for a long period using "-s"? I was running it on a 50mb file and I just cancelled it after several minutes. I was using a 'colaboratory' machine so I didn't get the CPU usage. Size of input.cmix was of 0 while running, 5 after cancelling
@hexagone: I use a few files. enwik8 is one of them. enwik9 comes in when I finish the script. I also have 'sda4', which is the first 50000000 (50mb) of
Code:
sudo cat sda4 | strings >sda4
There is '7z', generated from the latest sources this way:
Code:
##pseudo
wget https://www.7-zip.org/a/7z1900-src.7z
7z x 7z1900-src.7z
7z a -ttar $extracted_files
strings TAR_archive >7z
And I will probably add a few more or stop using some of these, depending on whether they are representative of real world data.
I will enlarge kanzi's block size and come back with hard data. Thanks for the tip!
>>How about the compression/decompression speedup ?
Not yet. But I plan to measure it when I got all the preprocessors up and running
>I have smaller textual files in my testset and sometimes scores of preprocessor usage could be different than for bigger files.
>For example R.DOC
Totally! That's why I'm writing a test script rather than just sharing my findings (I will when I finish)
I want everyone to find out by themselves which preprocessor is better suited for their needs. It should be easy enough once it's finished. And I believe if sufficient people use it, we could get an approximation as to which ones are generally better than the rest. An ultimate version could be even run as a nightly benchmark and use the collected data to find groups of text inside the text group. For example, one method could be best for c++ code, another one for precomped pdfs. Anyway, we're to far away from that now. Just keep throwing me programs, and I'll keep adding them to the test.
@all: Should I write the test and then release it, or would you prefer to have an early version available for editing?
-rw-rw-r-- 1 fred fred 10192446 Nov 23 09:02 /disk1/ws/lzma/dickens
-rw-rw-r-- 1 fred fred 6225768 Nov 23 09:06 /disk1/ws/lzma/dickens.cmx
-rw-rw-r-- 1 fred fred 5381646 Nov 23 08:55 /disk1/ws/lzma/dickens.dict
-rw-rw-r-- 1 fred fred 6174584 Nov 23 08:54 /disk1/ws/lzma/dickens.drt
-rw-rw-r-- 1 fred fred 7423169 Nov 23 08:56 /disk1/ws/lzma/dickens.f
-rw-rw-r-- 1 fred fred 4572828 Nov 23 08:54 /disk1/ws/lzma/dickens.knz
-rw-rw-r-- 1 fred fred 6178347 Nov 23 08:55 /disk1/ws/lzma/dickens.mcm
-rw-rw-r-- 1 fred fred 3964126 Nov 23 08:55 /disk1/ws/lzma/dickens.xwrt
Post compression:
-rw-rw-r-- 1 fred fred 2616433 Nov 23 09:06 dickens.cmx.lzma
-rw-rw-r-- 1 fred fred 2612178 Nov 23 08:55 dickens.dict.lzma
-rw-rw-r-- 1 fred fred 2614075 Nov 23 08:54 dickens.drt.lzma
-rw-rw-r-- 1 fred fred 2713840 Nov 23 08:56 dickens.f.lzma
-rw-rw-r-- 1 fred fred 2580741 Nov 23 08:54 dickens.knz.lzma
-rw-rw-r-- 1 fred fred 2831189 Nov 23 09:02 dickens.lzma
-rw-rw-r-- 1 fred fred 2658463 Nov 23 08:55 dickens.mcm.lzma
-rw-rw-r-- 1 fred fred 2541032 Nov 23 08:55 dickens.xwrt.lzma
xwrt is the winner here.
Now, just to see what happens on non-text files.
Post processing:
-rw-rw-r-- 1 fred fred 4149414 Nov 4 2002 rafale.bmp
-rw-rw-r-- 1 fred fred 4149433 Nov 23 09:21 rafale.cmx
-rw-rw-r-- 1 fred fred 3468672 Nov 23 09:17 rafale.dict
-rw-rw-r-- 1 fred fred 7202392 Nov 23 09:15 rafale.drt
-rw-rw-r-- 1 fred fred 4149186 Nov 23 09:14 rafale.f
-rw-rw-r-- 1 fred fred 4149436 Nov 23 09:14 rafale.knz
-rw-rw-r-- 1 fred fred 4255197 Nov 23 09:14 rafale.mcm
-rw-rw-r-- 1 fred fred 4149436 Nov 23 09:15 rafale.xwrt
DRT does not have a text detector and expands the file significantly.
Dict is the only one to compress this file which means it is not purely focused on text, maybe more LZ like ?
All the others (except mcm) keep the size similar to the original.
Post compression:
-rw-rw-r-- 1 fred fred 976257 Nov 4 2002 rafale.bmp.lzma
-rw-rw-r-- 1 fred fred 999271 Nov 23 09:21 rafale.cmx.lzma
-rw-rw-r-- 1 fred fred 1041599 Nov 23 09:17 rafale.dict.lzma
-rw-rw-r-- 1 fred fred 1076639 Nov 23 09:15 rafale.drt.lzma
-rw-rw-r-- 1 fred fred 996625 Nov 23 09:14 rafale.f.lzma
-rw-rw-r-- 1 fred fred 975986 Nov 23 09:14 rafale.knz.lzma
-rw-rw-r-- 1 fred fred 1005676 Nov 23 09:14 rafale.mcm.lzma
-rw-rw-r-- 1 fred fred 976249 Nov 23 09:15 rafale.xwrt.lzma
Only kanzi and xwrt did not affect overall compression.
Last edited by hexagone; 23rd November 2019 at 21:00.
Reason: Fixed DRT results
I put all but one kanzi helpers just to see what happened. I'll delete the irrelevant ones later.
*.ppmd is "fazip ppmd IN OUT"
*.7zf is fastlzma2 max compression
I will probably change this to lzbench, turbobench or some other when I have all the preprocessors included. This is just a proof of concept.
CMIX is a clear winner so far if used in conjunction with ppmd. BWT is good for flzma2.
I believe I should make a LIPTify dictionary for each major group; C++ code, Pascal code, HTML, etc.
Some helper are better applied together. As it is extremely impractical to try all permutations, I'm open to suggestions about which ones should be paired...
Last edited by Gonzalo; 23rd November 2019 at 21:45.
Reason: Added enwik8 results - Edited out the non-helpful entries
Just to be clear, for kanzi only -t text is relevant. Other transforms do completely different things (eg. SRT, RANK or MTFT are used after a BWT, not in isolation).
Look at the help command (-h) for the definitions of compression levels. But if you want to compare text preprocessors only, just use "-t text -e none", the rest is not relevant.
Look at the help command (-h) for the definitions of compression levels. But if you want to compare text preprocessors only, just use "-t text -e none", the rest is not relevant.
Yes, I did. I realize that -t TEXT is designed for texts, but look at the numbers. BWT is always a better helper to flzma2, for example. That's why I still include it on the test.
Are you kanzi's author? Maybe you could shed some light over which transforms are good working together, like FA's dict+lzp, for example...
BWT is not a "helper", its a whole another compression algorithm, which is better than lzma for texts.
But there's no point in compressing BWT output with lzma (there're specialized BWT postcoders which are both faster and better).
BWT is not a "helper", its a whole another compression algorithm, which is better than lzma for texts.
But there's no point in compressing BWT output with lzma (there're specialized BWT postcoders which are both faster and better).
Could those be MTF>>Block Sorting? In any case, even if it's not designed to be a 'helper', it does help...
Of course you are free to compare whatever preprocessors work for your files. But BWT is a completely different beast and, for the record, comparing it to text based preprocessors is comparing apples to oranges. So maybe you can split your study into #1 purely text preprocessors (there is a great list in this thread) and #2 any preprocessors that may improve compression of your files. That would be more fair that mixing all results in one list.
As for transform sequences, you could try -t TEXT+BWT, -t TEXT+BWT+RANK+ZRLT, -t TEXT+BWT+SRT+ZRLT but the results also depend on the entropy coder you choose.
"block sorting" is BWT. MTF is "symbol ranking".
BWT|MTF|EC is an old scheme popularized by bzip2, but its not especially good.
By "specialized coders" I meant something like QLFC,DC or https://github.com/loxxous/Behemoth-Rank-Coding
> even if it's not designed to be a 'helper', it does help...
paq8px would also "help" in that sense.
That is, book1.paq8px.lzma would be still much smaller than book1.lzma.
But based on compression and speed its still better to use it on its own rather than as preprocessor.
Same applies to BWT.
So I took your advice and striped away all of kanzi's transforms except TEXT. I also added bulat's lzp after kanzi, xwrt and cmix. Seems like a good idea: