I'm looking forward to the how-it-works bits.
For the whole compressor, in fact.
I find the Etincelle results really impressive and am fascinated by what might be going on under the hood.
I'm looking forward to the how-it-works bits.
For the whole compressor, in fact.
I find the Etincelle results really impressive and am fascinated by what might be going on under the hood.
After programming the "incompressible segment detection & skipping" algorithm, i was asked if Etincelle would still be able to detect two identical segments which are incompressible, such as for example 2 identical zip files.
I said that yes, it should, having the feeling i'm keeping enough information to find such occurrence. That was easy to say. But better find out on real data.
I created a simple Tar file, which consists of twice enwik8 compressed with 7z. This is perfect repetition, at long range (about 30MB away). Therefore, ideal compression rate should be 50%.
Let's test with the "old" version :
beta2 : 100.00% - 49 MB/s - 186 MB/s
That's terrible result, the original Etincelle does not find any correlation !
There is an explanation though : these repetitions are very distant ones, in the range of 30MB away.
As a consequence, the table gets filled with a lot of useless "noise pointers", to the point of forgetting what happened 30MB before.
So now let's test the brand new beta 3 with "incompressible segment detection" :
beta3 : 70.86% - 480 MB/s - 1150 MB/s
That's much better. This time, some correlation was found. In any case, it proves that beta 3 in fact improves chances to find two occurences of the same zip files.
But why not 50% ?
This can be explained too : Etincelle has a limit in the size of repetition sequence it can handle. Therefore, after reaching the maximum repetition size, it has to find again a new match. However, it may not find immediately. Hopefully, some minimum book-keeping ensure that it does get back in sync sometimes later, but in between, it lost quite some opportunities to size down the file.
Hence comes beta 4, a modification of Etincelle with Long-Repetition support. It should really help in this situation.
So let's put it to the test :
beta4 : 50.00% - 570 MB/s - 1300 MB/s
This time, it seems good. Perfect long range repetition.
Beta 4 seems to fullfill its objective, but let's check what happen in more complex situation. I created a new tar, with several already compressed files mixed with normal files. The compressed files first occurrences are intentionally regrouped together at the beginning, to try to trick the "detection & skipping" algorithm. They may repeat later, at large distance.
The resulting file is 126 MB long, and its best theoric compressed size is the sum of compressed parts counted only once, which is 62 258 KB.
Let's start comparison :
beta2 : 102 546 KB - 60 MB/s - 169 MB/s
beta3 : 69 922 KB - 160 MB/s - 350 MB/s
beta4 : 62 768 KB - 165 MB/s - 355 MB/s
Within close range to theoric minimum. Seems good enough.
What about real files now ?
Long repetition are not common in normal situation.
One important exception is MailBoxes, which tend to be filled with several instances of identical attached files. These files are more often than not Zipped, therefore the new algorithm makes wonder detecting them.
Appart from this situation, expect less impressive results. Having tested several large files, beta4 usually offers small gains, if only by a few KB.
Anyway, it can only improve compression rate compared to beta3. So it seems only beneficial.
beta4 :
http://sd-1.archive-host.com/membres...elle-beta4.zip
Regards
Last edited by Cyan; 10th April 2010 at 13:53.
b4 is slower than b3 and in a bunch of tests i just run gives the same size output most the time, and saves a few bytes on other files. Not sure though how commonly we'd expect to see files with many repeating incompressible parts?
So you could have this mode enabled by default and also have a switch to turn off b4's feature if that is even possible.
one interesting idea is to add rep-like detector of large repetitions
a disagree a bit (but not really much). if have several large files from real life dta that has long distance matches of 512byte or above.
its mostly rerelease game dvd iso. wher they put context of several cd's onto one dvd
to avoud to much cd swapping in the first release several files are on multiple cd's and on the dvd release no in the same iso
you can see my prior post about srep reduce 6 gb archive into 3 gb
ive got at least a handful of files where above 1gb dictionary still improves compression notable.
Thanks for testing, Eugene.
Looking at your figures, it seems that beta2 was a bad design choice after all. I may have to revert that one for next release.
Indeed, Intrinsic, this small speed hit is a mystery to me. It is also witnessed on Eugene results. Something weird is happening, probably on the compilator side.
While i was gladly surprised to see that the "incompressible segment detection" was unobtrusive, keeping in mind that it costs almost one additional test per loop,
i'm now more than amazed to witness a little cost, in the range of 2%, for Large Repetition code, while this one is nearly never triggered.
I mean, the test itself is probably happening once every 100K loops, and the action is so simple !
There is something abnormal happening.
That's why i'm suspecting Compiler optimisations.
The only explanation i can come up with is that the additional code, while being itself unobtrusive, may have indirectly an impact on optimizations done in the main loop.
Alas, i've no talent to read directly the ASM code produced by the compiler. So i'm just trying different codes to see if i can avoid this side-effect.
Indeed Bulat, I agree.one interesting idea is to add rep-like detector of large repetitions
I therefore tested a large GCC directory with many large repetitions (many *.exe are almost identical). Srep can reduce it from 550MB to 275MB, half the size.
Then applying Etincelle, gains were less impressive; directly compressing the original leads to 101MB, while compressing SREP's one results in 99MB.
It seems Etincelle already does a pretty good job at finding those large repetitions.
However, it is limited to its dictionary size, and its efficiency is reduced with larger sizes, while SREP can go much much farther.
So in fact they only compare for files of a fews hundred MB. Beyond that, only SREP stands.
Last edited by Cyan; 10th April 2010 at 18:06.
Yea, I also know a real world case when something is stuffed several times.
DVD audio / video discs often contain a lot of stuff that's not normally accessible, and I've seen quite a few cases when there were 2 copies of some audio stream. I've seen some repeated rubbish too (IIRC 7 copies of a 30 sec. sample of Beethoven on a Bach DVDA). And a case where all 24/96 tracks were written twice.
Hello
You'll find at this website the latest version of Etincelle, named RC1.
http://phantasie.tonempire.net/pc-co...s-t102.htm#160
We are close to V1 full release now.
Changes are minor.
Due to modification on element sizes, Etincelle compress a little better most filetypes, except text.
Default buffer size is now 128MB instead of 1GB. You can still set it manually using the -m switch. Maximum is 3300MB.
And the benchmark mode now also accepts large files as input.
Please don't hesitate to report any side effect you may witness, as this version is likely to become the final one.
I also used the week-end to update the "Fast compression graphic benchmark", with the Binary source based on a VM content.
http://phantasie.tonempire.net/pc-co...rk-t96.htm#150
Regards
Last edited by Cyan; 12th April 2010 at 19:29.
I do not know, but it seems to me, that etincelle is not more powerful (=better compression ratio) than the standard zip compression. It is pretty damn fast, I agree, but less powerful.
xul.dll (from Firefox 3.6.3 directory): 11 676 632 bytes
etincelle (default): 6 681 170 bytes
7zip's ZIP (ultra, deflate mode): 5 975 429 bytes
7zip's ZIP (fastest, deflate mode): 6 248 074 bytes
PKZIP 2.50 win32 CLI (default): 6 150 533 bytes
PKZIP 2.50 win32 CLI (super fast): 6 778 017 bytes
I tested other files too (bitmaps, text, etc.), but conclusion is the same.
Indeed, Mindi, you are right,
the mention "for large files" is missing.
And this mention has an importance. Etincelle need some "size" to ramp up, probably more than 10MB, while zip is at full strength above 32KB.
Etincelle is more designed to compress Virtual Hard Disk, ISO, Large Mail boxes and other container types, whose size is frequently at and above a few hundred of MB, and for which fast compression makes a sensible productivity difference.
Compressing individual files is probably fast enough with Zip, so being even faster didn't look usefull. Now i can be wrong, and this forum is here to listen your point of view.
For Etincelle to be competitive in this area, there is a need for a specific "ramp up sequence", removing inefficiencies with smaller bets on smaller quantities. Quite possible, but as i was not actively looking for this target, i accepted this inefficiency for the sake of simplicity. The code is very compact, with a single loop efficient on most of the compressed file, but granted not at the beginning.
Note that, even on large files, the "Etincelle compress more than Zip" is more a "trend" than a definitive statement. First, the difference is not so large, zip -9 often pretty close to Etincelle. Then, both compressors have very different algorithms, and therefore assumptions of one can prove better from time to time. You may find some counter example to this trend here and there, this is just unlikely beyond 100MB, but not impossible.
Last edited by Cyan; 14th April 2010 at 00:32.
You're both right. Imo, Etincelle benefits from its large dictionary. So, if there are correlations in the data which zip cannot exploit because of its small dictionary etincelle easily outperforms zip.
The same holds true for sx. sx has stronger compression than Etincelle, but it looses momentum because its dictionary is limited to 16mb, iirc.
For the given compression speed etincelle does a pretty neat job. Great job!
Thanks Christian !
Slug was instrumental in proving that such a speed is possible, so in a way Etincelle owes this to you.
In my opinion, i believe SlugX always compress better. At least that's what i noted on graphic chart :
http://phantasie.tonempire.net/pc-co...rk-t96.htm#148
I'm pretty short of counter examples, so I'm just happy when Etincelle is not too far behind...
Btw, i conjectured that one of the differences between our compressors is the use of some kind of range coder within SlugX, providing improved entropy compression compared to huffman. This could explain the more sensible differences in decompression speed. Now, this is just conjecture...
Regards
Last edited by Cyan; 14th April 2010 at 01:42.
Hi Cyan!
SX, as you know it, does not use a range coder, but higher order huffman coding (than etincelle). Undisclosed versions of SX are quite better than SX but I never released them - you probably know why.
But you're right, SX's advantage is due to better entropy coding. I think so, too (SX does evaluate only one match per round, nonetheless SX uses a better schema than rolz, too). As a side note, current versions of SX's entropy coder surpass slug's (1.x) orginal speed while providing better compression. Just keep your mind open...![]()
Last edited by Christian; 14th April 2010 at 09:27.
Indeed, Order 1 Huffman is bound to improve compression rate.
I'm just amazed you did that at such a speed, as it's more complex than it sounds. But i guess it could explain too decompression speed.
As a quick question : i guess you are using this method for literals only ?
I've never imagined that this could be useful for Match lengthes for example, but maybe i'm wrong.
I'm asking because literals are not so numerous after matchfinding, limiting scope for size reduction.
Regards
Last edited by Cyan; 15th April 2010 at 13:01.
Hi
I'm currently considering open-sourcing Etincelle, most probably after v1 release.
Appart from a few questions on license selection, i'm currently wondering on what usage could actually benefit from such algorithm, as the main point for open-sourcing is imho to be usefull to other's applications (archivers being one of them).
Etincelle is just a compression algorithm. It is fast, true, but also consumes a lot of memory (although this can be tuned by parameters). This makes it less interesting for communication protocols, which are the prime consumers of fast algorithms. I'm not sure it can be interesting for desktop usage either, as most of the time, users will favor higher compression ratio and are ready to sacrifice time for it. Or so i guess.
So I only imagine regular large backup as potential candidate, since backup is typically a dull task you want to finish fast, and get back to normal activity, or even continue working in parallel. Moreover, backup typically imply large data sizes, for which Etincelle is pretty good at. VM backup seems a specifically good example. Maybe could also be considered : databases, mailboxes, ISO, HDD images, etc.
These are just rough ideas, and the point of this call is to get your advises on what Etincelle could be usefull at. It will give some good hints to provide some useful interfaces, and select the best method to open the source code.
Regards
Last edited by Cyan; 15th April 2010 at 13:56.
There are three more good reasons for open-sourcing the code:
Closed source micro-apps die a long cold quiet death of inactivity
- that it survives the next time you accidently delete it from your harddisk. This happens! I lost lots of little bits of code I've since missed simply because I change computers occasionally
- that others might learn from it, discuss it
- that others might improve it
> get your advises on what Etincelle could be usefull at
http://quicklz.com/testimonials.html
> So I only imagine regular large backup as potential candidate
Backup is weird. First, you really need full archiver framework
for it (to check for unmodified files, duplicate files, etc).
Btw, there're usually many files in large amounts of data, even
when it seems that there's only one
(we can parse files and extract structures).
Also backup is commonly a "write-only" application, and CMs
frequently show better compression speed/ratio than LZ.
As to networking, LZ only shines in "dumb" solutions, where
you fetch the data, then extract it when its complete
(same with compression).
But CM is usually much more streamable, and if it can be
overlapped with networking - again, LZ is worse, at least if its
not a gbit pipe.
But anyway, I'm sure that you'd be able to get it integrated into freearc![]()
Indeed, QuickLZ is a very good choice for communication protocols, as it is not only fast, but also very light.
Now, this is interesting to see that several applications are compressing very large set of data to improve bandwidth usage during transfers, which proves there is an interest for on-the-fly compression of large datasets (in contrast with packet-content or php/java scripts).
There is probably zero interest in Etincelle as a console-only application. It's usefull to get an idea of the performance, but not really usable.Closed source micro-apps die a long cold quiet death of inactivity
The objective is mainly to provide the algorithm with an API interface, to allow smooth integration into any program, wether it is an archiver or a more specialized application.
And indeed, providing clean and easy-to-integrate interface is probably the main task after opening the source code. I guess this can be improved upon, and that's where feedback is welcomed.![]()
Actually API choice frequently strongly affects the algorithms.
For example, I recently prefer codec APIs which can be used like this:
But when we take into account that buffer sizes hereCode:void processfile( FILE* f, FILE* g, byte* inpbuf,uint inpbufsize, byte* outbuf,uint outbufsize ) { init(); addout( outbuf, outbufsize ); while( 1 ) { int r = process(); if( r==0 ) break; if( r==1 ) { uint l = fread( inpbuf, 1, inpbufsize, f ); if( l==0 ) break; // get/put don't support repeated failures addinp( inpbuf, l ); } else if( r==2 ) { fwrite( outbuf, 1, outbufsize, g ); addout( outbuf, outbufsize ); } } fwrite( outbuf, 1,outptr-outbuf, g ); // flush }
are determined by app's conveniece (eg. can be 1 byte or whatever),
there's not much codecs compatible with that.
Anyway, the codec state has to be completely incapsulated and
it should be possible to process multiple streams at once within
a single thread.
And while APIs with FILE* are nearly worst possible
(its unportable as FILE type is compiler-specific, and there're other issues),
but RAM-only APIs are not much better either - the common
task is compressing streams, not files.
Also, callbacks for stats and i/o (and sometimes memory management)
are very annoying.
I think that taking a codec library and writing a sample file-to-file
coder for it should be straightforward, and I'm certain that its not
the case when you have to define 10 callbacks to do that, or have
to drop the compression when progress reporting is required after each 64k
of data (like with some memory-only APIs).
I was considering something along the line of zlib,
as it seems the most used compression library, so i believe many programmers should feel at ease with it.
I agree, and that's probably the main part i need to rewrite correctly.Anyway, the codec state has to be completely incapsulated and
it should be possible to process multiple streams at once within
a single thread.
Not something too difficult i guess.
I'm currently having a very difficult time trying to circumvent a problem which i'm starting to believe is compiler specific.
I'm investigating Intrinsic claim that speed decreased when introducing "very long matches" in beta4. And yes it's correct. I changed my test sample, and now i can safely measure the difference quite clearly.
This looked very strange to begin with, as the "very long matches" modification is really unobtrusive. I mean, it is a simple test and pointer modification which happen once and a while, something like once every 100K loops. So it should not be possible to see the difference.
Digging further, i replaced the culprit code by a useless one, and here is what i got :
This line of code has no effect at all on speed. It does nothing usefull, but happen so rarely it's impossible to measure its impact.if (!len) ref = continuation;
Similarly, the following line has no impact either :
if (!len) ref++;
This line of code decreases performance by 3-5%. It does nothing usefull either, and does not happen more than the previous one.if (!len) continuation = ref;
Similarly, the following line decrease speed, although less sharply (~2%) :
if (!len) continuation = 1;
A few words on variables used :
len : an integer that was calculated just before the test.
ref : a pointer which is very very often used in the main loop
continuation : a global variable pointer, which is never used anywhere, except if the test is correct.
Even taking in consideration cache misses, there is no way a simple pointer value happening once every 100K loops can change performance in such a measurable way.
So i decided to "read the binary", using IDA Pro, as suggested by Shelwien some time ago.
And indeed, i noticed that the main loop was affected by the presence of the slowing second line (while this line is not at all in the loop!).
The whole beginning of the loop seems different, more complex and heavier.
So i guess the compiler tries to be too smart.
I was wondering how i could prevent such an effect.
Couldn't find a "disable optimisation" compiler directive that could be applied on a specific sub-part of the code. (there is, but on the whole code only)
I tried to change the line by a function call : same result. I guess the function is getting inlined by the optimizer.
Couldn't find a switch to ensure this specific function would not be inlined. (there is too, but on the whole module, not specifically for a function)
So i'm short of ideas. Maybe writing this problematic line of code in assembler, ensuring it does not get "optimized" into the main loop ?
Rgds
Last edited by Cyan; 21st April 2010 at 22:26.
gcc: -fstrict-aliasing
Thanks Bulat.
I tried your suggestion, but unfortunately did not succeed to end up with something usefull.
In the end, i changed the code entirely, and now the test, instead of being done outside of the main loop, is done within the main loop.
That means, basically, that both the test and action (pointer update) are now triggered much more often; and nonetheless it ends up with a gain on compression speed.![]()
yes, weird....
Anyway, this lead us to Etincelle RC2.
It basically improves compression speed at beta3 level and better, while improving compression ratio. The new update mechanism seems to almost always improve compression, even if by a tiny margin.
There is although a new "asymetric hashing" strategy, with sensible gains on large files with large repetitions, but small losses on smaller files. I should also write something on this, and the theory behind.
http://phantasie.tonempire.net/pc-co...s-t102.htm#160
http://sd-1.archive-host.com/membres...ncelle-RC2.zip
[Edit] : RC1 -> RC2 comparison analysis by Sami Rumsas : http://compressionratings.com/i_etincelle.html
So now, the next step is to open-source.
I've been convinced to use GPL, thanks to many advises.
Now, there is still to choose between v2 and v3.
Apparently, there is enough difference to produce some friction between FSF and Linux some time ago.
Have you got an opinion on this ? is v2 more stable or more flexible as a license than v3 ? or v3 more secure ? or more compatible ?
Well, kinda lost in all this...
Rgds
Last edited by Cyan; 23rd April 2010 at 15:08.
many developers write smth like "gpl v2 or higher, you may choose yourself". at least this allows to combine your work with any other gpl spurces, be it licensed by v2 or v3
Your library ought to be licensed with the LGPL, and the command-line tools with the GPL.
I feel that so strongly that I'm telling you in an uncompromising way, not just advising you ;D
Interesting point of view Willvarfar.
Since then, I had a look at LGPL statement on GNU site,
and here is a token of my understanding :
So, if i understand correctly,Why you shouldn't use the Lesser GPL for your next library
(...) we are now seeking more libraries to release under the ordinary GPL.
(...) There are reasons that can make it better to use the Lesser GPL in certain cases. The most common case is when a free library's features are readily available for proprietary software through other alternative libraries. In that case, the library cannot give free software any particular advantage, so it is better to use the Lesser GPL for that library.
This is why we used the Lesser GPL for the GNU C library. After all, there are plenty of other C libraries; using the GPL for ours would have driven proprietary software developers to use another?no problem for them, only for us.
since there are already many compression libraries out-there, using LGPL (7zip) or even less restrictive licenses (zlib),
it seems correct to use LGPL for Etincelle compression library.
Comments are welcomed
generally speking, i recommend GPL if you want to earn money on it (of course using secondary commercial license), lgpl if you don't need money but want to keep your ideas protected (from using in closed-source software), and bsd otherwise