Does anybody know how should called LZ method
which woud use incomplete matches (difference compensation)
For example 123456 and 123356 be a match plus some compensation data 000100
Does anybody know how should called LZ method
which woud use incomplete matches (difference compensation)
For example 123456 and 123356 be a match plus some compensation data 000100
Only *diff utilities do that.
There're no LZ compressors like that, because it would be awfully
slow to search for such patterns, and then compression might even
decrease after than, because this support for "patches" provides
an alternative way to generate the same sequence with different
codes.
Such a thing can be implemented with a CM+LZ combination, though.
First you process the data with CM (eg. paqand replace the
symbols with too small probability (and store the patches somewhere).
And then any LZ would have somewhat improved compression, if
there was really a lot of mistypes or something like that.
Last edited by Shelwien; 26th May 2008 at 16:11.
Yes it would be slow, but i think if compare
LZD vs DELTA+LZ it is like PPM vs BWT
because current implementations has much limitations
and also imho it would be superior for mm data like WAV,BMP
Ah, I misunderstood.
Thought that you was talking about sparse matches or something like that.
But anyway, LZ is only useful for types of data with high probability of
long matches. And numeric data which can be better compressed with
delta-coding doesn't commonly contain any matches.
I mean, of course a delta model would be useful.
But an algorithm able to use that kind of correlations just won't be called LZ.
Last edited by Shelwien; 26th May 2008 at 16:23.
I didnt mean the data is numeric, that numbers is for example
Cant see any reason why its not LZ
Compensation should be be somehow limited maybe -8 +7, 16 values
then it would take half of data spase
PS: i have a feeling that when Bulat read this he will add it to FA
which would be great
Last edited by chornobyl; 26th May 2008 at 17:04.
Its better to compress whole numbers, not their digits or something.
Also I'd never seen an example of data with "hidden matches".
Usually if there're matches in delta sequence, it means that original
data at locations with matching deltas is similar too.
And there's no reason to compress data without matches using LZ.
> I didnt mean the data is numeric, that numbers is for example
And if the data isn't numeric, then we return to the topic of
sparse matches. Of course its possible to create a LZ generalization
with support for any patterns, not only substrings, and it would
certainly be LZ, and I guess it could have better compression than
traditional LZ, but the matter of compression time kills the whole idea.
Btw, there's example of that in the form of motion prediction with
exhaustive search in video compression - check its speed to see
how a sparse LZ could work.
Is there any way for "Blind" huffman decompression
when you havent got stored tree and you rebuild it
the reason why i am asking is simple - preprocessind of ANY huff compressed
data, not just JPEG or ZLIB stream
exact precise data is not needed, only to restore correct probability
ie remove huffman's enthropy
> Is there any way for "Blind" huffman decompression
> when you havent got stored tree and you rebuild it
Of course, if there's enough redundancy in the code.
Though that's already cryptography I think.
Ah, and code has better to be static, or else.
And considering that all the common huffman-based
formats use blockwise encoding, you'd have to be
able to split the stream into blocks at least.
> the reason why i am asking is simple - preprocessind of ANY huff compressed
> data, not just JPEG or ZLIB stream
> exact precise data is not needed, only to restore correct probability
> ie remove huffman's enthropy
Its a good application of LZ optimal parsing (and better the one which uses
precise code length metric, not some random weights for matches in sequence).
I mean, the only "catch" in huffman code is its prefix property, so it
should be possible to recover with plenty of (redundant) data by converting
bits to bytes and applying the optimal parsing.
I mean, if there's enough redundancy in the data, then most of matches would
be starting at the start of some huffman code.
But there probably won't be enough redundancy in the sequence of LZ matches
for the 32k block of data (and after that huffman tree is switched).
So I guess that only option is to drop the idea of recovering the tree
and just concentrate on removing its redundancy.
For that you'd need some good bitwise CM. An easy hack might be to take paq8/lpaq/paq9
and remove the bit context zeroing on byte boundaries, and add a context shift
and masking on each bit instead.
Another optimization would be adding some optimal parsing counterpart to CM.
It may look like simulating insertion of a special separator symbol at each
position and comparing the compression with and without it.
(A good example of this is worse CM compression of texts with removed spaces,
despite smaller file size.)
nowadays CM algos uses mixing
what about this sort of mixing
for i=1 to n do
a'=sqrt(a*b)
b'=0.5(a+b)
a=a'
b=b'
end;
i tried to visualize common lz methods as i unerstand it, correct me if i wrong
2008-02-05
I don't really understand what you visualized there?!
something like dictionary fill, that depends from file position
In my understanding the LZ77 dictionary is the sliding window, sized N bytes. So the dictionary will fill up after processing 0...N-1 bytes and will remain full after that.
LZ78 based compressors should behave similar, the dictionary fills up until the available memory is exhausted. Afterwards the dictionary can be kept (1), rebuilt or discarded (2). It will produce either a sawtooth graph (2) or a limited ramp function (like LZ77).
can you draw it?![]()
funny picturesbut really there are only 2 variants - sliding dict is filled and kept full, block-splitted dict filled, kept full until end of block and then cleared before next block
Here we go...
@Bulat:
Blockwise processing is implementation specific, partitially rebuilding the dictionary, too (third graph). I'd just show the most "general" cases for streaming data processing.
2 bulat
what about LZW?
how it will look like
LZW is a LZ78 derivate, so it'll look like that. Some LZ78 variants might grow the dictionary nonlineary, since they add multiple phrases per byte.
most lz* algos grow dict non-linearly - first growth is very fast then it becomes slower due to obvious reason
yes, there are algos which clears dict non-entirely, but this not depends on algorithm. you can do it with lz78, ppm and probably any other algo in universe
Usually cm algos predict current symbol from previous.
What about predict from NEXT symbol, at least one.
Just like video codecs produce B frames, and predict em from next/previous/both frames.
> Usually cm algos predict current symbol from previous.
Well, because its most useful.
> What about predict from NEXT symbol, at least one.
There're different ways to do that.
1. its possible to calculate the probability distribution
for two next symbols and fix it up using backward context
statistics. Nobody does that because of the obvious reason (speed).
2. its possible to just exclude last symbol from the
context - this is called a sparse context and is commonly
used in all strong compressors.
> Just like video codecs produce B frames, and predict em
> from next/previous/both frames.
In the video case its basically reordering, and there's some
sense to do that because a frame is relatively large.
If we apply this approach to general data, that would be
block reordering, not something on symbol level.
And that's called segmentation and is used in some compressors.
Also symbol order reversing is used sometimes, mostly in
BWT compressors, as "right" contexts seem to provide better
statistics clustering for texts.
And special preprocessors or submodels, like E8, might reorder
the bytes in some specific areas, where its known to be useful.
Then, there's an "extraction" approach - its possible to implement
a multi-pass CM compressor, using both "left" and "right" contexts -
just encode the locations of some symbols or words and remove them
(or replace with something else).
And straight reordering on symbol level is not any reasonable as
it requires encoding a lot of redundant information.
I faced situation when i need to compress two alike but not identical files.
This where 399MB wavpack files, one of which damaged during transmittion.
Straight approch is to compress them with 400MB dictionary and forget.
But this is almost fantastic with my 512MB of ram.
So i took 7zip, split files on 45MB pices and compress with 48MB dictionary, sorted by extensions files like A.001 B.001 appeared together and final size where 410MB which i expected.
This is first time when ROLZ compression where made by human+lz77![]()
Last edited by chornobyl; 2nd November 2008 at 00:59.
Once I made some utilities for such cases.
http://shelwien.googlepages.com/patch_v1.rar
Well, the idea was to be able to
1) patch the broken file on remote server with a valid local copy
2) patch the broken file with a valid copy on remote server
3) create a patch file for distribution when multiple people have
the broken file.
And archivers are certainly not very convenient for that.
Kinda offtopic though.
xdelta?
Most of known implementations improving LZ methods by adding come CM stages where LZ does not performs well or CM much better. So why not add some LZ model and if it founds that on certain data block LZ gets better results then mark this block as "LZcompressed". The main problems are:
- both algos work simultaniously so this may take twice or more time and memory to compress
- problem of smart blocking.
Of course we assume that "LZcompressed" is not discarded from CM stats, it is processed on de/compression same way. This may also cause slightly other approach:
1 compress data with CM in ordinary way.
2 try to improve results by applying LZ on same data
3 combine blocks from both results to get better compression
ps: all mentioned before may appear obsolete, if it founds that this stuff already exist and called "match model"
There is no variant of LZ, even super designed, that may beat a proper designed CM. The only difference is speed. The only advantage of hybrid schemes such as LZ+CM is that we may use LZ for very long matches and CM in other cases - making some speed improvement.![]()
Of course the usual LZ implementations are not any
good from modelling point of view - as they allow for multiple
encodings of the same data, and thus are considerably redundant.
But its possible to make a LZ-like mixing scheme, with codelengths
of all available matches estimated for given position and corresponding
rc intervals integrated into a symbol probability distribution.
Now, that would be superior to common CM methods, as they only consider
the symbol probabilities at a single point, while LZ-like approach
would be something like separately mixing all the substring instances.
But the complexity of that is obviously too high, and if we'd try
to approximate it somehow, it'd probably become a common CM eventually.
In short, LZ is a considerably different modelling approach, and thus
it can better handle some kinds of data, sometimes even despite the
redundancy of LZ compressors.
And as to blockwise switching of CM/LZ, then yeah, its useful,
but rarely implemented, I'm only sure about rar and nanozip.
That's probably because usually archiver developers and compressor
developers are different people.
While match models and the like are quite common these days, its
still mostly a speed optimization and doesn't have the same
kind of probability estimation as a usual LZ.
Actually there are LZW's that can do better than a proper desinged CM
at least for some files.
The reason mine often compress better some files. At least till
the internal tables fill up is because they are bijective from the
space of all files to the space of all files.
You can check then out at
http://bijective.dogma.net/compreslzw1.htm
Actually people seem to forget that there is no best compressor
for all files. Even the unity transform beats the best super
compressor for some files since to compress any files means
that it expands others.
The only reason CM does so good is that as it compresses
it tries to figure out which model is the best. But if one is
only interested in certain types of files and could care less
about another set of files CM is not the best.
Yes, you are right. That's the main reason why WinRAR still is competitor to 7-Zip. Because, WinRAR tries to classify compressed data while 7-zip just blindly compresses data.Originally Posted by biject.bwts
CM is a very generic term in my point of view. Someone can optimize an "optimizable CM coder" for a specific file (look at toffer's M1). This way is superior for me when I compare well-optimized LZ style compressors. Also, some more specific thing can be made: floating-point model etc (look at Shelwien's GEO packer which outperforms even PAQOriginally Posted by biject.bwts
![]()