I'm not sure... Should I get MinGW to compile it?
I'm not sure... Should I get MinGW to compile it?
Yes, it seems to compile like this with g.bat: http://nishi.dreamhosters.com/u/paq8pxd75_src_0.7z
Might have to modify "set gcc=C:\MinGW820x\bin\g++.exe" for mingw path and "-march=k8" to "-march=native" for speed.
I cannot download the file.. Could you compile this version?
Yes, there's an exe inside too.
Darek (3rd March 2020)
@Shelwien: Did you change something in this version? As far enwik8 scores are the same as for oaw8pxd v75
I didn't change anything, just added script to compile it.
Since you're testing it, I thought you could try finding optimal mod_ppmd parameters (order and memory).
Darek (4th March 2020)
True, enwik8 -s8 is also worse (if only wordmodel considered), it just compresses better as whole.
At this moment its has arbitrary settings. Was meant to be on same (total) memory usage level with px when last used. Really depends what the input is, and so on.
In version 34 there was one more https://github.com/kaitz/paq8pxd/blo...8pxd.cpp#L6483
E: also to limit memory usage below 30GB at the time.
Last edited by kaitz; 5th March 2020 at 20:23. Reason: mem
KZo
I try to compress enwik10 with paq8pxd_v75_AVX2 but it never finish.
Console:
paq8pxd_v75_avx2 -s15 enwik10
Creating archive enwik10.paq8pxd75 with 1 file(s)...
File list (21 bytes)
Compressed from 21 to 18 bytes.
1/1 Filename: enwik10 (1410065408 bytes)
Block segmentation:
0 | default |2147483646 [0 - 4186751275]
1 | default | 2 [2147483646 - 2147483647]
2 | text | 135946 [2147483648 - 2147619593]
3 | default |2147483646 [2147619594 - 4186751273]
4 | default | 1 [135944 - 135944]
5 | text | 135949 [135945 - 271893]
6 | default |2147483646 [271894 - 1410065406]
7 | ARM |1518281428 [4186751276 - 1410065407]
Task manager:
paq8pxd_v75_AVX2.exe (still running):
CPU time: 302:33:06 (12.6 days)
Peak working memory: 393,940K
I/O read: 38,365,960,503 bytes
I/O write: 12,515,576,592 bytes
Files (both created 12.5+ days ago):
enwik10.paq8pxd75 0 bytes
tmpBAC1.tmp 7,960,731,648 bytes
I accidentally noticed that all paq8pxd versions >10 can not compress or uncompress (just crashes) on my amd athlon II x4 640 processor (SSE4, 2010).
Version <7 is work fine, all paq8px versions work fine too.
Will look into this next week. It should be detected as text or default. Problem is there is hard split at 2GB, as seen in log. And detection should be over within an 40-55 mins witch in this case did not happen.
I think starting from version >10 and up i used SSE4 as main target. This amd CPU lacks partial/proper support for it? (http://www.cpu-world.com/CPUs/K10/AM...0WFGMBOX).html) Mabye compile it from source.
https://encode.su/threads/342-paq8px...ll=1#post64174
In todo list :)
Also got this wrton working, some things were compiler dependent ( i have gcc v8.1 and v4.9) and also had wrong order of variable initialization (had no effect on older compiler).
This allowed to merge better pdf compression. On reymont it was something like 14kb better.
Just need some free time to think about this. :)
KZo
Progress so far.
There are some stupid strings in dict like:Code:7689823266 pxd -s0 (time 1 hour) 1865350519 7z (time 1,5 hours)
– is (3 bytes) utf8 char and is treated as that.Code:000–15 1815–1816 1815–1817 1815–1818 1815–1824 1815–1830
Overall there is about 310000 words. Some utf8 chars at beginning and most one utf8 chars at the end of dict.
For 10GB there is about 75GB of read and 45GB of write. (detect,transform,compare, compress/copy,final arhive) What a waste :D
KZo
paq8pxd_v76
This wrt colum mode is really helpful. Mostly in wordmodel.Code:- Change wordModel1 to compress pdf text (from paq8px_183fix1) - Fix jpeg thumbnail compression - Make online wrt work - In wrt split num/utf8 chars, also some other utf8 chars. Large file mode - Allow large text block detection (+2GB) - Set utf8 for text if found - Change wordModel1, recordmodel to use wrt column mode - Change sparsemodelx - Small fixes - Show progress when detecting data
Wanted to use it long time ago. Probably only breaks if utf8 char columns. I expect more improvements.
dickens is about 10kb better with -s8
EDIT:
I uploaded v77 to git. (dickens) is 1kb better vs v76 and my current test is vs v77 1kb even better.
Last edited by kaitz; 19th March 2020 at 21:54.
KZo
@kaitz -where you uploaded v77 version?
Darek (20th March 2020)
Hmmm, looks that there are indeed improvement in textual files but other types of file heve some backdrafts.
In total my testset got 29KB worse score (0,3%). Here are scores of my testset for paq8pxd v76.
kaitz (21st March 2020)
enwik8:
16,314,392 bytes, 5,782.234 sec. paq8pxd v76 -s8
16,316,789 bytes, 5,817.031 sec. paq8pxd v77 -s8
15,965,102 bytes, 5,904.337 sec., paq8pxd v76 -s15
15,967,512 bytes, 5,933.575 sec., paq8pxd v77 -s15
Last edited by Sportman; 21st March 2020 at 02:16.
Scores of 4 corpuses for paq8pxd v76 and v77. Despite my testset worse scores, for all 4 corpuses both paq8pxd v76 and v77 got the best scores and very good improvemets!
For Silesia corpus there are 123KB less on paq8pxd v77 than paq8pxd v75!
kaitz (23rd March 2020)
some enwik scores gathered:
16'319'686 - enwik8 -s8 by Paq8pxd_v75_AVX2
15'976'838 - enwik8 -s15 by Paq8pxd_v75_AVX2
16'260'265 - enwik8 -x8 by Paq8pxd_v75_AVX2
15'912'509 - enwik8 -x15 by Paq8pxd_v75_AVX2
15'859'187 - enwik8.drt -x15 by Paq8pxd_v75_AVX2
125'761'484 - enwik9_1423 -x15 by Paq8pxd_v75_AVX2
126'074'749 estimated - enwik9_1423.drt -x15 by Paq8pxd_v75_AVX2
16'314'392 - enwik8 -s8 by Paq8pxd_v76_AVX2 - tested by Sportman
15'965'102 - enwik8 -s15 by Paq8pxd_v76_AVX2 - tested by Sportman
16'253'017 - enwik8 -x8 by Paq8pxd_v76_AVX2
15'899'380 - enwik8 -x15 by Paq8pxd_v76_AVX2
15'856'800 - enwik8.drt -x15 by Paq8pxd_v76_AVX2
16'316'789 - enwik8 -s8 by Paq8pxd_v77_AVX2 - tested by Sportman
15'967'512 - enwik8 -s15 by Paq8pxd_v77_AVX2- tested by Sportman
16'255'214 - enwik8 -x8 by Paq8pxd_v77_AVX2
15'901'484 - enwik8 -x15 by Paq8pxd_v77_AVX2 - tested by Kaitz
15'856'824 - enwik8.drt -x15 by Paq8pxd_v77_AVX2
125'65x'xxx estimated - enwik9_1423 -x15 by Paq8pxd_v76_AVX2
kaitz (24th March 2020)
paq8pxd_v78
l.pak,k.wad not fixed for now.Code:- Change wordModel1,recordmodel
This change mostly will work only with internal wrt. drt processed files will not benefit from it. Most compression is on plain text files and comes from wordmodel.
enwik8 -s8 should be 19kb smaller.
KZo
Darek (24th March 2020),Mike (24th March 2020),moisesmcardona (24th March 2020),Sportman (29th March 2020)
enwik8/9 scores for paq8pxd_v76:
15'928'916 - enwik8 -x15 by Paq8pxd_v74_AVX2
125'752'479 - enwik9_1423 -x15 by Paq8pxd_v74_AVX2
15'912'509 - enwik8 -x15 by Paq8pxd_v75_AVX2
125'761'484 - enwik9_1423 -x15 by Paq8pxd_v75_AVX2
15'899'380 - enwik8 -x15 by Paq8pxd_v76_AVX2
125'974'773 - enwik9_1423 -x15 by Paq8pxd_v76_AVX2 - hmmm, there is an 0,17% loss to v75 version, 0.18% to v74 version. The v74 is still the best!
paq8pxd v77 and v78 tests ongoing.
paq8pxd_v78 scores on my testset. In general no big changes. Some improvements for textual files. Some loses for bigger files.
kaitz (26th March 2020)
paq8pxd_v78 scores for 4 corpuses => another version with all 4 records for paq8pxd serie!
kaitz (26th March 2020)
First enwik scores:
16'319'686 - enwik8 -s8 by Paq8pxd_v75_AVX2
16'314'392 - enwik8 -s8 by Paq8pxd_v76_AVX2 = -6'300 bytes
16'316'789 - enwik8 -s8 by Paq8pxd_v77_AVX2 = +2'400 bytes
16'291'281 - enwik8 -s8 by Paq8pxd_v78_AVX2 = -25'500 bytes -> good improvement!
Other enwik8 scores:
16'316'789 - enwik8 -s8 by Paq8pxd_v77_AVX2
15'967'512 - enwik8 -s15 by Paq8pxd_v77_AVX2
16'255'214 - enwik8 -x8 by Paq8pxd_v77_AVX2
15'901'484 - enwik8 -x15 by Paq8pxd_v77_AVX2
15'856'824 - enwik8.drt -x15 by Paq8pxd_v77_AVX2
16'291'281 - enwik8 -s8 by Paq8pxd_v78_AVX2
15'941'450 - enwik8 -s15 by Paq8pxd_v78_AVX2
16'231'687 - enwik8 -x8 by Paq8pxd_v78_AVX2
15'877'659 - enwik8 -x15 by Paq8pxd_v78_AVX2
15'852'312 - enwik8.drt -x15 by Paq8pxd_v78_AVX2 - drt got smaller improvement than pure file however it still provides to best score ever for paq8pxd series!
enwik9 estimate = 125'802'xxx - very close to paq8pxd v74!
125'752'479 - enwik9_1423 -x15 by Paq8pxd_v74_AVX2
125'797'519 - enwik9_1423 -x15 by Paq8pxd_v78_AVX2 - slightly worse than paqpxd v74
kaitz (29th March 2020)
I found a suspicious thing:bufn.setsize(0x10000);
if (level>=9) buf.setsize(0x10000000); //limit 256mb
else buf.setsize(MEM()*8);
Do I read it right and paq8pxd uses 256mb buffer for enwik9 here?
kaitz (28th March 2020)
paq8pxd_v79
enwik8 -s8 is about 18kb smaller then v78.Code:- Change wordModel1 some html entities rollback - Some fixes
KZo
Darek (29th March 2020)