Stephan, thanks for reporting.
I have time on next weekend. I hope.
Stephan, thanks for reporting.
I have time on next weekend. I hope.
KZo
Stephan Busch (26th February 2018)
@Kaitz - will you publish v47 or it's only internal?
Darek (2nd March 2018)
This is my .tar of Camera Raw testset:
https://drive.google.com/open?id=11u...jF5EywuZjTNw53
D:\TESTSETS>paq8pxd_v46_speed -s0 c.tar
Creating archive c.tar.paq8pxd46 with 1 file(s)...
File list (17 bytes)
Compressed from 17 to 22 bytes.
1/1 Filename: c.tar (577854464 bytes)
Block segmentation:
0 | default | 80496 [0 - 80495]
1 | jpeg | 12388 [80496 - 92883]
2 | jpeg | 2077017 [92884 - 2169900]
3 | default | 24855167 [2169901 - 27025067]
4 | jpeg | 18874 [27025068 - 27043941]
5 | default | 2 [27043942 - 27043943]
6 | jpeg | 3553878 [27043944 - 30597821]
7 | default | 22929426 [30597822 - 53527247]
8 | jpeg | 16966 [53527248 - 53544213]
9 | default | 2 [53544214 - 53544215]
10 | jpeg | 2795912 [53544216 - 56340127]
11 | default | 22047724 [56340128 - 78387851]
12 | jpeg | 8406 [78387852 - 78396257]
13 | default | 19891506 [78396258 - 98287763]
14 | jpeg | 779708 [98287764 - 99067471]
15 | bintext | 18989124 [99067472 - 118056595]
16 | jpeg | 627967 [118056596 - 118684562]
17 | default | 13245477 [118684563 - 131930039]
18 | jpeg | 2795912 [131930040 - 134725951]
19 | default | 9477824 [134725952 - 144203775]
20 | hdr | 3936 [144203776 - 144207711]
21 | 24b-image | 230400 [144207712 - 144438111] (width: 960)
22 | bintext | 10359712 [144438112 - 154797823]
23 | jpeg | 7573 [154797824 - 154805396]
24 | default | 18446187 [154805397 - 173251583]
25 | jpeg | 797378 [173251584 - 174048961]
26 | default | 59198 [174048962 - 174108159]
27 | jpeg | 1016655 [174108160 - 175124814]
28 | default | 10647881 [175124815 - 185772695]
29 | jpeg | 102457 [185772696 - 185875152]
30 | default | 122671 [185875153 - 185997823]
31 | jpeg | 938643 [185997824 - 186936466]
32 | default | 44397 [186936467 - 186980863]
33 | jpeg | 1290134 [186980864 - 188270997]
34 | default | 17769578 [188270998 - 206040575]
35 | jpeg | 103522 [206040576 - 206144097]
36 | default | 59294 [206144098 - 206203391]
37 | jpeg | 932367 [206203392 - 207135758]
38 | default | 497 [207135759 - 207136255]
39 | jpeg | 3004310 [207136256 - 210140565]
40 | default | 24004682 [210140566 - 234145247]
41 | jpeg | 8194 [234145248 - 234153441]
42 | bintext | 23070 [234153442 - 234176511]
43 | jpeg | 990722 [234176512 - 235167233]
44 | default | 15160158 [235167234 - 250327391]
45 | jpeg | 9032 [250327392 - 250336423]
46 | default | 23384 [250336424 - 250359807]
47 | jpeg | 877947 [250359808 - 251237754]
48 | default | 15224869 [251237755 - 266462623]
49 | jpeg | 8366 [266462624 - 266470989]
50 | default | 186290 [266470990 - 266657279]
51 | jpeg | 1167939 [266657280 - 267825218]
52 | default | 17959357 [267825219 - 285784575]
53 | jpeg | 703899 [285784576 - 286488474]
54 | default | 19122789 [286488475 - 305611263]
55 | jpeg | 665830 [305611264 - 306277093]
56 | default | 19156602 [306277094 - 325433695]
57 | jpeg | 34005 [325433696 - 325467700]
58 | default | 24833483 [325467701 - 350301183]
59 | jpeg | 1325770 [350301184 - 351626953]
60 | default | 78490 [351626954 - 351705443]
61 | jpeg | 29399 [351705444 - 351734842]
62 | default | 18363571 [351734843 - 370098413]
63 | jpeg | 3188843 [370098414 - 373287256]
64 | default | 57207 [373287257 - 373344463]
65 | jpeg | 56994 [373344464 - 373401457]
66 | default | 6774 [373401458 - 373408231]
67 | jpeg | 6098520 [373408232 - 379506751]
68 | default | 31480976 [379506752 - 410987727]
69 | jpeg | 327074 [410987728 - 411314801]
70 | default | 10390 [411314802 - 411325191]
71 | jpeg | 7530480 [411325192 - 418855671]
72 | default | 5218857 [418855672 - 424074528]
73 | hdr | 513 [424074529 - 424075041]
74 | dBase | 115382785 [424075042 - 539457826]
75 | default | 13312499 [539457827 - 552770325]
76 | jpeg | 717452 [552770326 - 553487777]
77 | default | 24366686 [553487778 - 577854463]
Segment data size: 1015 bytes
TN |Type name |Count |Total size
-----------------------------------------
0 |default | 35 | 388244391
1 |bintext | 3 | 29371906
2 |dBase | 1 | 115382785
3 |jpeg | 36 | 44620533
4 |hdr | 2 | 4449
10 |24b-image | 1 | 230400
-----------------------------------------
Total level 0 | 78 | 577854464
default stream(0). Total 533003531
jpeg stream(1). Total 44620533
image24 stream(5). Total 230400
Stream(0) compressed from 533003531 to 533003531 bytes
Stream(1) compressed from 44620533 to 44620533 bytes
Stream(5) compressed from 230400 to 230400 bytes
Segment data compressed from 1015 to 1015 bytes
Total 577854464 bytes compressed to 577855554 bytes.
Time 769.19 sec, used 234 MB (246040919 bytes) of memory
kaitz (2nd March 2018)
RAW formats are a nightmare Kaido. The manufacturers believe in security by obscurity
In EMMA, just for the parsers it's ~1.4k LOC, and then you still need specific models.
And we'd have to rewrite the whole preprocessing stage to properly support them.
You should try testing changes to cmix then, it's like watching paint dry
You can probably strip the stemming code from the model, or does paq8pxd sometimes skip WRT for text files?
KZo
Darek (4th March 2018)
Here are scores for my testset. Similar like for paq8px line improving textual model hurts some nontextual files. Like mpais said - K.WAD doesn't like such changes.
However there are some quite nice gain for text files.
And there are one issue - for E.TIF file there are a quite big loss due to inproper recognise part of a file like text. E.TIF is an image compressed by LZW. Most of files can squeeze it a bit (max = 2.5%) and score paq8pxd v46 was similar but after wrong recognition as a text v47 got some backdraft. Of course E.TIF is an only example of case for probably more files.
Last edited by Darek; 4th March 2018 at 22:05.
kaitz (4th March 2018)
Scores for 4 corpuses for v47 - little gains for Calgary and Canterbury testsets, some backdraft for Maximum Compression but very good gain for Silesia - mozilla file got 100'000 bytes less score!
Second information - my scores for v47 on enwik8:
16'080'717 - enwik8 -s15
16'103'601 - enwik8.drt -s15
looks like enwik9_1423 should be about 127'35x'xxx bytes.
kaitz (5th March 2018)
Scores for v47 on enwik8 and ewik9:
16'080'717 - enwik8 -s15
16'103'601 - enwik8.drt -s15
127'404'715 - enwik9_1423 -s15 - I've submitted last score to Matt on priv but there no response. Now this record should be posted.
@Matt - could you add this submission to LTCB page?
Paq8pxd_v47:
enwik8 - 16'080'717 -> option -s15 - encode time: 7'432s, decode time: 7'627s, memory used: 27'500MB
enwik9_1423 - 127'404'715 -> option -s15 - encode time: 75'022s, decode time: in progress, memory used: 27'500MB
System: Core i7 4900MQ at 3.8GHz, 32GB, Win7Pro 64, decompression in progress.
Source code and 1423 resplit is attached in 7ZIP file = 139'841 bytes - Matt zip could be a little bigger but it's still 3'rd place in LTCB.
@kaitz - will you plan to add preprocessing/recompression of images precompressed by RLE or LZW?
From my testset there are two such files: D.TGA - RLE and E.TIF - LZW.
i have made a little improvement to paq8pxdv47 and this is the result
xml -s14 253475
anybody may test it for enwik8/enwik9 using -s15 option because i only have 16Gb memory. i have attached the source code and the binary. thank you
@suryakandau@yahoo.co.id - could you also add image and jpg improvements from sim?
xml -s15 253'470 bytes for paq8pxd_v47_suryakandau version
xml -s15 253'855 bytes for original paq8pxd_v47 version
I'll try to compress enwik8 by -s15
@suryakandau I've tested your version for my testset, 4 corpuses and enwik8.
There are some gains for textual files but also there are bigger loses for other, especially bigger files then total scores are weaker than original v47.
enwik8 comparison:
16'080'717 - enwik8 -s15 by Paq8pxd_v47
16'103'601 - enwik8.drt -s15 by Paq8pxd_v47
16'109'814 - enwik8 -s15 by Paq8pxd_v47_suryakandau - worse score for pure file
16'101'169 - enwik8.drt -s15 by Paq8pxd_v47_suryakandau - small improvement to DRT file but still this score is worse than v47 pure.
Last edited by Darek; 19th March 2018 at 13:03.
new improvement for xml and ooffice
xml 256053
ooffice 1382609
maybe someone in this forum may implement LSTM in this version, i guess LSTM can improve very big ratio..
@suryakandau - and what about loses for bigger files - as in previous your version? Did you check it?
from SILESIA: dickens, nci, webster
from Maximum Compression: english.dic
and enwik8
But newer/today version looks better - 4 tests in progress but for my testset there is a slightly step forward especially for text files (in attached table). I suppose that for other testsets will be similar.
16'083'438 - enwik8 -s15 by Paq8pxd_v47_suryakandau 2 - better than previous your update and very similar to original v47!
16'099'737 - enwik8.drt -s15 by Paq8pxd_v47_suryakandau 2
What about image model improvements? Could you add it also to paq8pxd?
Last edited by Darek; 23rd March 2018 at 12:34.
suryakandau@yahoo.co.id (23rd March 2018)
And the scores for 4 corpuses.
For smaller tests (smaller files) there are some gains but for Maximum Compression and Seilesia, especially for bigger files - there are loses which sum up to worse score...
some improvement on enwik8 compression using -s6
v47 original 16.739.918 -
my version 16.729.127
i don't test using -s15 option, any volunteer may test it using -s15 on enwik8 and enwik9. maybe it could get third place in LTCB.![]()
v47 original version already got third place on LTCB, however Matt didn't update LTCB page yet.
See this post: https://encode.su/threads/1464-Paq8p...ll=1#post56169
I would test -s15 option after paq8px v141 tests.
scores for enwik8:
16'080'717 - enwik8 -s15 by Paq8pxd_v47
16'103'601 - enwik8.drt -s15 by Paq8pxd_v47
16'109'814 - enwik8 -s15 by Paq8pxd_v47_suryakandau
16'101'169 - enwik8.drt -s15 by Paq8pxd_v47_suryakandau
16'083'438 - enwik8 -s15 by Paq8pxd_v47_suryakandau 2
16'099'737 - enwik8.drt -s15 by Paq8pxd_v47_suryakandau 2
16'070'898 - enwik8 -s15 by Paq8pxd_v47_bwt - looks quite promising, maybe it could go 100KB less than v47 for enwik9
16'106'855 - enwik8.drt -s15 by Paq8pxd_v47_bwt
Until now my improvement just focus on enwik8/9. I will include the source code at the next version.