Can you think of a compressor with compression speed thousand times higher than decompression speed?
Do you know publications mentioning or discussing such unusual cases?
Can you think of a compressor with compression speed thousand times higher than decompression speed?
Do you know publications mentioning or discussing such unusual cases?
paq on multi-core cpu. its algorithm allows to execute all models independent at compression, but not at decompression
Yeah, it should be possible to make paq 4x faster on a quadcore.
Comparing versions fully speed-optimized to 1 and 4 cores is another
matter though, as paq is not really optimized as it is.
But anyway, its a bit hard, but possible too
(decompression being much slower than compression)
For example, if we'd implement a "normal" BWT compressor
and reduced-memory decompressor (like in bwmonstr).
Also there're quite a few other cases where normally additional
complexity is added to the compressor, but its possible to make
the compressor a bit faster and decompressor much slower.
I mean stuff like ABS coder, code stream (de)interleaving and
various transforms.
Also such cases might be common in recompression - for example,
instead of detecting the deflate settings at compression side in precomp,
we could just store the inflate'd data and a hash of original block, and
leave the bruteforce to decoder.
Anyway, I can think of more examples if necessary, and they might
even make some sense (usually the encoding speed can be slightly
improved at the cost of significant decompression slow-down),
but its not quite practical anyway :)
Last edited by Shelwien; 25th April 2010 at 09:42.
Do you mean that only one core will be busy during decompression? When you get predictions from models, why can't you process half models on another core? After the bit is successfully decompressed, why can't you update half models on one core, and half models on another core?
Of course, there're some options for paralleling in decoding too, but very limited.
If the file was encoded sequentially, you basically can't decode the next bit,
until you know the previous bits (speculative decoding is possible, but doesn't
really change anything).
Sure its still possible to run parts on different cores, but with sync on each bit,
it would likely end up being slower than single-threaded version.
Well, unfortunately that's how it is.
The environment on PC is completely unstable, so its basically impossible
to force two threads to do different work in the same amount of cpu clocks
and then exchange their info without waiting.
The easiest thing would be to compute the probability estimation in one
thread and do the update in another, so that update for previous bit
and estimation for the next can overlap.
But then we'd have to verify whether estimation uses any of update results,
and wait for these if it does.
And then, we'd have to ensure that the update finishes before the next one,
and transmit the decoded bit to update thread.
Thus, both threads would waste quite a bit of time waiting (eg.10-20% easily),
and we'd need additional checks for context matches.
Now, this still might improve speed with a complex model like paq8,
but with faster CMs (like ccm,m1) it just doesn't make sense.
Alternatively, we can run some of submodels (both estimation and update)Code:| ctx check/wait | wait for bit | | P() | -- | | bit decoding | Update() |
in different threads. For paq8 this might be actually better because
there's a lot of independent submodels.
But still even with paq8 its unlikely that it would be better than 1.5xCode:| M1.P() | M2.P() | | wait for M2.P | wait for bit | | mixing | -- | | bit decoding | -- | | M1.Update | M2.Update |
with 2 threads, and further scaling would be even worse.
Anyway, I don't say that CM decoding speed can't be improved
by multithreading, but comparing to encoding the possibilities are
very limited.
Encoder has access to all the data at once, so its much easier
to run in parallel with barely any sync at all (blockwise).
There are platforms that do better in this regard.
Tilera CPU and their Zero Overhead Linux can warrant no context switches in particular cores.
Also another scaling idea, not related to Tilera:
First, for each bit of the source, count context statistics for some fixed length before it. Having the statistics, run all the models on the bit. Then comes mixing. If you use static weights, whole algorithm is inherently parallel (though the statistics counting does the same work multiple times in different threads...). If you do something smarter, then you might end up with a part that's sequential, but if lightweight enough, it might allow huge scaling too.
Last edited by m^2; 25th April 2010 at 21:48.
> There are platforms that do better in this regard.
Even just having a memory cache is enough to demonstrate
100x differences in timings of the same code.
Also, with more threads there're more points of sync... and each
potentially wastes time. So when the time required to complete
the task allocated to a thread becomes smaller than the time
wasted on sync, the whole process likely becomes slower than
single-threaded version too.
Regardless of how many cores the CPU has, if memory is split to fast memory and slow memory (e.g. RAM and HDD) and a random access to fast memory typically costs F cycles, while a random access to slow memory costs S cycles, then context-mixing-compression on this system can be up to (S+c)/(F+c) times faster than CM-decompression.
Because you can design the algorithm so that during compression almost all random accesses to memory fall to fast memory, and during decompression almost all of them fall to slow memory: all models can work in turn in compressor (therefore each model can use almost all fast memory), but in decompressor you must run through all models for each bit of data being decompressed.
What are the advantages of such design? For example, if compressor knows how much fast memory the decompressing side can allocate, it can initialize and run the appropriate number of models; e.g. 2 Gb RAM in decompressor => compressor having only 128 Mb of RAM can run up to 15 models consecutively, each of these models using ~128 Mb of RAM (although decompression on the device with 128 Mb of RAM will be either (S+c)/(F+c) times slower or impossible).
Last edited by Alexander Rhatushnyak; 27th April 2010 at 09:10.
srep -m1 may be interesting example, if you need algo that compress in ram and decompress using hdd
An algorithm which works with integer prime-factor multiplication for example. The inverse integer factorization is dead slow. There exist satelite link image compressors which work that way. Don't have the reference in the moment. I think a lot image compressors which have to compress on very simple machines which are not exchangeable (taking out the cartridge, no possibility of a cable, and so on) compress faster than decompress. The destination has all resources necessary.
Another "common" case would be if you have an complex implicit set of rules, and you create a source-string for that it's too complex to maintain related statistics. The sender may instantly produce the next symbol, while the receiver has to apply and exclude all rules which are violated to reach the unique valid set of possible symbols. The trellis code is an example if this kind of implicity.
Other trivial case, when the compressor has a huge amount of memory and the decompressor almost none you may have 1:1000 ratios.