Yes, I finally saw that the original ooffice was modified.
Yes, I finally saw that the original ooffice was modified.
I have shifted the E8/E9 filters into the SMAC_compress & SMAC_decompress functions. Now, original executable files are left untouched.
All in all, the code looks better like this.
Warning : Don't use previous versions of SMAC with executable files : they contain a bug where original files are tampered upon compression. However, they can be restored to their original form under decompression.
i recommend you to remove download of bad version. modifying original files is the worst thing archiver can do
Yes, nice idea.
It works now. http://mattmahoney.net/dc/silesia.html
Dear readers of forum,
I apologise for my question but I did not understand it. I assume that during (de)compression source file is only read. Is it not true?
Sincerely yours,
FatBit
Source file is only read indeed. The fact is that I copy/pasted the E8/E9 filter from my EXE Packer, Fastpack32, and it modified the EXE/DLL source files so that specific instructions like Calls & Jumps can be compressed more easily, by changing absolute adresses to relative addresses.
With EXE Packers, on the contrary, the source file is generally modified; that's why most Packers offer the choice to create a backup.
Edit: when you decompress a compressed file, the original file might be overwritten by the decompressed file, if both Original & decompressed files are in the same directory. This won't be a problem if decompression is exact.
Last edited by Jean-Marie Barone; 28th February 2013 at 10:40.
Well, it is an easy mistake to make if the input file is open for writing or update instead of reading.
I did a lot of experiments with Indirect Context Models, and version 1.12 contains the best "recipe" I could find (so far).
As explained by Piotr a few posts ago, a hashed context is now linked to a Bit History, instead of a pair of frequencies.
I took inspiration from ZPAQ to design the content of my Bit History entry :
Being 16 bits long, 65536 possible values can be stored in an entry.
Afterward, the Bit History value is used as an index in a Table of Frequencies, which contains a pair of bit 0/bit 1 frequencies (2*16 bits), each Order having its own table.
You can see I'm using 2 bits for the Last Bits Seen, to wit penultimate bit seen & last bit seen.
Being 7 bits long, each bit 0 & bit 1 counts in Bit History can contain a maximum value of 127.
Whenever this limit is exceeded, both counts are divided by 2 and incremented.
The same rescale is applied to the entries of the Tables of Frequencies, with a bound of 65535 this time.
Results on enwik8 & enwik9 are 24948001 & 216016106.
I managed to gain more than 2 minutes on enwik8 compression & decompression on my Intel Core 2 Duo 2 Ghz, by adding a single prefetchnta instruction.
That's funny: I had overlooked those prefetch instructions, for I read on various forums they had become useless with "modern" processors. Well, I guess it's better to make one's own opinion by oneself...
I don't know what "modern processor" means, but on my Core 2 Duo I've observed up to 2.5x read speed improvement from prefetching: http://encode.su/threads/1285-Random...ll=1#post25070
OTOH, I think that with heavily multithreaded code that is heavy on memory access, prefetching would bring less benefits - because memory controller has some limits on the number of concurrent accesses.
New results. http://mattmahoney.net/dc/text.html#2160
Now all you need to do is add context mixing. It should not take much more time since you are already computing 3 predictions and only using 1 of them.
Edit: Silesia results. http://mattmahoney.net/dc/silesia.html
Last edited by Matt Mahoney; 12th March 2013 at 01:59.
I've been wondering, lately, if what I call Context-choosing might not be better than Context-mixing.
Let's suppose we have 3 context-models : the fist one has an excellent prediction, the second an average one, and the last a bad one.
Context-choosing will pick up the fist model, and will obtain the best result.
Context-mixing will perform a weighting+blending of all 3 models and, even if a larger weight is applied on the first model, we'll end up with an inferior result since our excellent model prediction would become "tainted" by its 2 associated models.
On the other hand, if our predictor fails to select the best of three models, the impact would be worse than with context-mixing I presume. Still, if it can be tuned up correctly, by taking into account output prediction errors for example, it can be interesting.
In my opinion, it's a bit like if someone offers you to choose among a glass of french wine and californian wine, and you decide to mix them up and quaff the resulting mixture.
The most skewed prediction is not necessarily the best one.
Suppose that at some time one model gives predictions like:
P(random({0, 1})) = 0.99 (ie gives a probability of 0.99 to either bit 0 or bit 1, the bit is randomly chosen).
With your context choosing you would (almost) always use that model's predictions only and that would cause a huge expansion of data.
PAQ7-like mixer handles that situations gracefully. And not only that situation. I've started a thread recently: http://encode.su/threads/1679-Questi...ut-LPAQ1-mixer where I got a lot of answers to my questions about mixing.
Eg a mixer is able to amplify or smooth input predictions. It also reduces weights quickly on random data so you don't expand data so much even if all of your input predictions are completely wrong.
PAQ7-like mixer can do all that things because weights don't have to sum to 1. In my experiments (with Demixer) allowing weights to achieve values between (-70, 70) improves compression in comparison to a situation where weights are clamped to a (-8,range. With restricting the weights to values only from range (0, 1) and summing to one, the compression is usually much worse.
Last edited by Piotr Tarsa; 12th March 2013 at 11:02.
Thank you Piotr : I begin to see why selecting a single model can cause uncertain results.
I'm gonna put my brain in a mixer then!
The bad news is that I had to resort to FPU instructions in order to use the Stretch and Squash functions defined by Matt, hence a bad impact on speed. I'll try to experiment with some lookup tables to replace logarithmic functions and others.
The good news is that compression is noticeably enhanced : 23322767 on enwik8, and 202011435 on enwik9.
For each model, I compute a prediction such as p0=(n0/n0+n1)*232, and stretch it :
Stretch(p0)=Log2 (p0/(232-p0))
This has the effect of giving a higher weight to predictions near 0 or 1, namely predictions with high Confidence.
Then I calculate the average of the 3 stretched predictions, and use the Squash formula on it to reverse the effects of Stretch :
Squash(p)=232/(1+2-p)
I don't use any additional weights.
Nice improvement. http://mattmahoney.net/dc/text.html#2020
Using lookup tables would probably speed it up. To keep the tables small you can probably round the table inputs to about 12 bits without hurting compression too much.
Edit: Silesia updated. http://mattmahoney.net/dc/silesia.html
Last edited by Matt Mahoney; 25th March 2013 at 19:33.
SMAC now uses adaptive weights, ajusted to minimize coding cost. The update is calculated as follows :
Wi=Wi+Learning Rate*(1-Symbol-Squash(Σi Wi.Stretch Pi0)*Stretch Pi0)
where Pi0 is the probability for i'th model that next bit will be 0,
Symbol is the actual predicted bit,
Learning Rate is a constant, generally between 0.002 and 0.01, representing the adaptation rate.
Using 0.002 as the learning rate gave me good results, but maybe better results can be achieved by changing it...
I've been able to replace the Log2 instruction by a Lookup table with no significant impact on compression, reducing the mantissa from 23 to 14 bits.
Results on enwik8 & enwik9 are 22675896 & 193797222.
Last edited by Jean-Marie Barone; 21st April 2013 at 12:02. Reason: typo in formula
Do you have one mixer or many mixers? Usually having a small set of mixers helps to improve compression but you have to carefully invent a context for choosing the mixer.
Also you could do mixing of orders 6-4-3-2-1 with orders 2 and 1 being directly addressed (ie without hashing the context). That should improve compression on binary and mixed type files I think.
I use a single mixer working with Orders-6-4-3 if File Size is greater than 5 Mb. With smaller files I use Orders-4-3-2, and 3-2-1 for files smaller than 2 Mb. I'm a bit reluctant to work with more than 3 orders, because I'm afraid it will hurt the speed, but I might give it a try if it has a good impact as you suggest. Also, I noticed that compression on small files is not that good, so it could be helpful indeed. Using several mixers looks a bit daunting for the moment. May be when I become more experienced![]()
I have followed the advice of Piotr and have integrated Orders-2 & 1, directly addressed in memory since hashing becomes unnecessary with such small contexts. Also, I made them non-stationary rather than ICM to gain a few speed.
The outcome with enwik8 & enwik9 is not spectacular (22303381 & 191064676), but more interesting with the Silesia Corpus where executable files benefit from this update, as predicted by Piotr.
I got rid of the fdiv instruction in the Stretch function, by calculating Log2(p0)-Log2(p1) which turned out to be faster than previous Log2(p0/p1).
Otherwise, I am now considering adding a Preprocessor to eliminate long repetitive strings (>7 chars).
I have some ideas of an algorithm roughly based on the Star-transform which could support all kind of files, not only text-ones. I have written the organization chart; this is a first draft, so there is room for improvement.
Updated LTCB. http://mattmahoney.net/dc/text.html#1910
Edit: Silesia benchmark. http://mattmahoney.net/dc/silesia.html
Code:Total dicke mozil mr nci ooff osdb reym samba sao webst x-ray xml Compressor 49701757 2351 16509 2461 2219 2334 2725 1124 3902 4898 6583 4086 505 smac 1.1.5 51522104 2370 17555 2474 2236 2501 2756 1151 4052 5035 6667 4190 528 smac 1.14 53291856 2405 18199 2514 2350 2627 2778 1167 4235 5100 6852 4508 551 smac 1.13. 56236960 2554 19220 2795 2588 2718 2937 1231 4559 5177 7270 4595 588 smac 1.12a 57465304 2624 19395 2826 2603 2762 3037 1231 4706 5361 7452 4810 653 smac 1.11b 62675226 2689 21648 2864 2521 3703 3275 1305 5132 5612 7557 5642 722 smac 1.10 64661799 2630 22832 2749 2654 3736 3199 1288 5753 5561 7951 5551 751 smac 1.9 66227778 2806 22410 2753 3271 3736 3187 1400 6108 5391 8956 5326 878 smac 1.8
Last edited by Matt Mahoney; 21st May 2013 at 17:49.
Jean-Marie Barone (22nd May 2013)
The Zero-Transform doesn't look promising to me, but you can try it and post the results. I think REP/SREP/et similia are tested well enough and they provide reasonably good results.
As to possible further ideas to try I suggest many mixers (selectable by context) and/ or two-level mixing (ie in first pass output more than one probability and in second pass combine them using another mixer).
High order mixing turned out to be pretty hard in my experiments. Coupled with low efficiency of hash tables when it comes to high orders, it doesn't look attractive to explore. Anyway, you could try some creative variations of hash tables that have potential to squeeze more useful statistics into the same memory area. Shelwien written about his ideas in this forum, but he seems not reachable as of now.
PS:
The most useful context for selecting mixers should be the index of the highest available context - ie if you have models for orders 1, 2, 3, 4 and 6, then you can have 5 mixers and on every bit choose the appropriate one. Maybe you're doing something like that already, but I have an impression that you don't (you said you were afraid of multiple mixers).
Last edited by Piotr Tarsa; 22nd May 2013 at 14:54.
Jean-Marie Barone (22nd May 2013)
I'm still pondering whether I should start coding my ZT or not. I'm afraid I'll have to struggle with collisions...
I have tested REP, WBPE, XWRT, ASMXML and I had some good results with enwik8 (20946710 with XWRT), but the shortcoming is that they only work with pure text files, while I'm searching to get rid of long patterns regardless the type of file.
May be I'll give it a try though, to do something new. I'll give you the results, be they good or bad...
It makes me think of PAQ9A, where Matt used "chains of 2-input mixers rather than a single mixer combining many predictions".As to possible further ideas to try I suggest many mixers (selectable by context) and/ or two-level mixing (ie in first pass output more than one probability and in second pass combine them using another mixer).
This looks enticing indeed.
Last edited by Jean-Marie Barone; 22nd May 2013 at 15:35.
You don't need to have chains of mixers to have multiple mixers.
What I said, that assuming that currently you have one mixer with 5 inputs (of models of orders 1, 2, 3, 4 and 6), then you can add four more mixers:
- first with only one input (ie order 1) used when only prediction of order 1 is available - this could probably be omitted,
- second with two inputs - order 1 and order 2 when order 2 prediction is available but no higher order predictions,
- same for three inputs (order 1, order 2 and order 3) and four inputs (order 1, order 2, order 3, order 4),
- current 5 input mixer will stay of course and will be used when order 6 prediction is available,
REP/SREP work with any type of file. IIRC they are just variants of LZ77 designed for long and distant matches. XWRT and other XML oriented transformers do a lot of transformations, they are heavily dependent on the structure of XML. LPAQ has simple word model which is like a normal n-order model except that n is not fixed but it's min(some-fixed-limit, length-of-recent-run-of-alphabetic-chars).the shortcoming is that they only work with pure text files
If you want your codec to work efficiently on many types of files then you need to develop filters for many types of files. That would make the codec prohibitively complex, assuming it will be still coded entirely in assembly.
Ok. So when you have collected predictions for, say, 5-input mixer, 4-input mixer, 3-input mixer & 2-input mixer, do you select the best prediction, or do you mix them? In the latter case, do you also apply a weight to each input mixer?
I could preprocess the file regarding its content, applying a dedicated filter for JPG, WAV and so on, but that's a hell of a job, I agree.If you want your codec to work efficiently on many types of files then you need to develop filters for many types of files. That would make the codec prohibitively complex, assuming it will be still coded entirely in assembly.
For the moment, I just want to get rid of long repetitive patterns : since SMAC is limited to a maximum Order-6, it would be helpful.
For example, in an EXE file, there are a lot of 00 in the header which could be reduced, or in a BMP file, you might find a same long pattern.
paq9a uses an ICM-ISSE chain like zpaq -method 4, plus sparse and word models. It uses an order 12 LZP preprocessor to remove long matches first, but this turns out to hurt compression quite a bit and is mainly a speed optimization. You would probably get better compression using a match model, which paq9a lacks because the lzp preprocessor makes it unnecessary. The most useful preprocessor is probably e8e9 for .exe and .dll files.
Bulat Ziganshin (12th May 2016),Jean-Marie Barone (23rd May 2013)
Seems that you misunderstood.Ok. So when you have collected predictions for, say, 5-input mixer, 4-input mixer, 3-input mixer & 2-input mixer, do you select the best prediction, or do you mix them? In the latter case, do you also apply a weight to each input mixer?
Firstly a terminology:
I say that prediction is not available/ not present/ etc when a particular context is either seen the first time ever or was evicted from the hash table (due to collisions) after last time it was seen.
A coding step is all actions during coding of one bit. Complete encoding process consists of initialization, a coding step for every input bit and finalization (encoder flush).
Encoding of each input bit requires the following steps:
1. collect predictions from models, there are up to 5 predictions,
2. mix them to get prediction,
3. encode/ decode bit using that prediction,
4. update mixer,
5. update models,
To make the algorithm behave correctly, the mixer in step 2 and step 4 needs to be the same mixer within each coding step. However you can use different mixers during different coding steps, for example you can have two mixers and use first one on bits of odd position and second one for bits of even position.
What I propose (or better: what I have seen in LPAQ IIRC) is to have 5 mixers. Use first one if in step 1 you collected only 1 prediction. Use second on if in step 1 you collected 2 predictions. And so on, finally use fifth one if in step 1 you collected 5 predictions (ie all models had available predictions).
Two level mixing is different from the above concept, but is not incompatible.
That would be a speed optimization. IIRC in LPAQ there's a dedicated long match model and if there is very long consecutive match then all models except match model are disabled until a mismatch is encountered.For the moment, I just want to get rid of long repetitive patterns : since SMAC is limited to a maximum Order-6, it would be helpful.
For example, in an EXE file, there are a lot of 00 in the header which could be reduced, or in a BMP file, you might find a same long pattern.
Jean-Marie Barone (23rd May 2013)
Both of those techniques are used in various paq8 (but I think not lpaq, or possible in zpaq). One problem with disabling models for long matches is that those models will not see most of the match and then predict the next byte badly because it doesn't see the immediate context. I only used it for really long matches like 400 bytes or more. As for using the context match length as mixer context, it does help, but I'm not sure if it works any better than just using an order 0 or 1 context to select the mixer weight vector, like is normally done in zpaq. It might.