I just posted a write-up for a mini-research for pre-processing Unicode text. The pre-processing technique is a matrix transpose (so nothing new). Nevertheless, it's interesting to see its benefit -- particularly -- for ZPAQ and RAR.
I just posted a write-up for a mini-research for pre-processing Unicode text. The pre-processing technique is a matrix transpose (so nothing new). Nevertheless, it's interesting to see its benefit -- particularly -- for ZPAQ and RAR.
I once added this kind of filter in paq8pxd: https://github.com/kaitz/paq8pxd/tre...75619c70d4e994
Cant remember what effect it had. https://encode.su/threads/1464-Paq8p...ll=1#post28037
KZo
I see your point. It's out of the scope for my project. I've been writing a tool for recognizing patterns in binary data so I thought it could be useful in the compression field as well as it could be useful in other fields such as digital forensics.Brotli has a simple utf-8 compatible context model. Have you considered doing the optimization through contex modeling instead of transposing?
I run some test with Brotli on the same samples that mentioned in the blog post. It seems Brotli benefits the pre-processing too on these two isolated samples.
7-Zip 17.01 ZS v1.3.2 R1
Compression method: Brotli
Compression level: Level 11 (Ultra)
Real-world Data Results:
1,828,539 file.packed.brotli.7z
1,577,959 file.transformed.packed.brotli.7z
Random Data Results:
39,910,585 unicode.packed.brotli.7z
34,418,513 unicode.transformed.packed.brotli.7z
Jyrki Alakuijala (11th February 2018)
You might try the nibble transpose in the more general TurboTranspose. For your UTF-16 files, you can set the element size (esize) to 2 in the call to tp4enc.
The byte transpose in your experment is corresponding to the "tpenc" function with esize = 2
xinix (11th February 2018)
In-memory transpose with data size 82,097,412 fluctuates around these values:How fast is the 'untranspose' phase?
Transpose: 00:00:00.5122561
Untranspose: 00:00:00.4562677
Note, that my research pre-processor running on .NET Framework 4.