# Thread: Applications for possibility of removing statistical modelling from compressor?

1. ## Applications for possibility of removing statistical modelling from compressor?

Some time ago I was writing about a problem for reversed entropy coding (thread): encoding any bit sequence as a sequence of symbols having some chosen statistics, such that only the sender knows the probability distributions - for example to encode as a sequence resembling some common data for various steganographic/watermarking purposes, like QR codes resembling a picture (grayness becomes probability of using 1 in the corresponding pixel of the code - ICIP2014 paper).
If decoder knows the probability distributions, it can be easily done by using entropy coder with switched encoder and decoder.
If decoder doesn't know, it is more difficult, but the same rates can be approached (generalization of the Kuznetsov-Tsybakov problem).

The interesting question for data compression is if we can perform standard entropy coding with probability distributions unknown to the encoder? (known only to decoder)
In other words, can we remove the statistical modelling from the compressor? (decoder still has to do it)
Thanks of recently working on JRC, I have just realized that it can be done!

So JRC (Joint Reconstruction Codes, thread) allows to perform error correction, such that in contrast to standard error correction, the encoder doesn't need to know damage levels (to choose redundancy levels). For example for broadcasting or storage media problem - while the encoder has to use universal packets, each packet (receiver) can have a different noise level.
In contrast to standard error correction, JRC decoder uses damage levels as individual trust levels for each packet - it tries to guess the original message to get agreement with all the received packets, tolerating some disagreement depending on the noise level of each packet.
It assumes that message bits were from uniform {0,1} probability distribution - we can put the statistical model there instead! (to restrict the space of interesting messages and so the required information)

So how to get entropy coding without the encoder's knowledge of the probabilities...
Assume the sequence is from a statistical model (iid, Markov, etc.) with entropy H bits/symbol - a N symbol sequence can be transformed into asymptotically length NH bit sequence.
Directly storing these symbol would take lg(m) bits/symbol instead, where m is the alphabet size, lg is base 2 logarithm.
In other words, while the number of all such sequences grows like m^N = 2^(N lg(m)), the statistical model restricts this growth to asymptotically 2^(NH) - we can apply this restriction on the decoder side.

So finally, the encoder should produce more than NH bits to allow the decoder to choose one of 2^(NH) messages in agreement with the statistical model.
These >NH bits can be exactly like for JRC: uniformly distributed checksum bits which depends on the entire history.
Now the decoder needs to search the space of possible 2^(NH) messages - it can be done by sequential decoding as in the current implementation of JRC: by just adding the statistical model to probabilities (weights) of candidates to consider.
I can modify JRC implementation if there would be some real applications?

One application could be making cheaper encoders e.g. for sensory networks (however, decoding would become much more expensive).
Another can be distributied source coding type of applications - imagine spreading sensors locally. Individually they have poor statistics to adapt to varying conditions, hence it might be beneficial to shift the need of knowing the statistical model to the decoder, which have information from all of them.
What real-life applications of such compression without statistical modeling could you think of?

2. ## Thanks (2):

Cyan (3rd September 2015),Shelwien (6th September 2016)

3. I think that this system could be quite useful for backing up data that's constantly changing/growing, like in a server environment. If this system encodes fast enough it can be used for external signal capture, maybe even used for video compression in slow motion cameras or other high bandwidth equipment.

4. This possibility is very expensive from decoder side, and really approaching the original compression ratio would make this cost huge - realistic compression ratio would be smaller (I could find Pareto coefficients).
While encoder produces some pieces of information, the decoder needs to search through the space of possible messages to find the one in agreement with all these pieces of information - focusing only on messages in agreement with some model, like probability distribution for entropy coding.
Observe that we do something like that for natural language - even if a sender would produce a random sequence of sounds in a noisy environment, the decoder's brain would try to decode it as some meaningful message: in agreement with the model of language - both deterministic (grammar) and statistical (what is he more likely to say).
Thanks of it, the space of possibilities is greatly reduced, and so much less pieces of information are required to determine the intended message.

However, for computer situation, the cost and imperfection makes it really difficult to find real applications ...
Especially that basic statistical modelling is really cheap (your server environment), and in many situations it is sufficient to use a static model, like for differences in lossless image compression or DCT coefficients in lossy...
If there are indeed realistic practical applications (?), it should be rather a situation where encoder doesn't have a good information about the model for the signal, like for spread sensors.

But probably this is only an interesting theoretical possibility (?)
To summarize, we can perform entropy coding without the knowledge of one side:
reversed: bits -> symbols without the knowledge of decoder (Kuznetsov-Tsybakov, steganography)
standard: symbols -> bits without the knowledge of encoder (searching only through messages in agreement with the model)
the tolerated lack of knowledge is on the "symbol" side in both cases - can we do anything when the lack of knowledge is on the "bits" side?

5. so this system of yours can only make an assumption on the compressed data? wouldn't this be a kind of lossy compressor?

6. Indeed the data needs to be compressible to make it reasonable.

Generally, data compression is based on assumption that given data comes from some model (deterministic: allowed sequences, statistical: typical sequences(AEP) ) - what allows to restrict the space of meaningful messages, and so use a shorter descriptor to point one of them (the compressed file).

This restriction to "typical" messages is used by encoder in standard data compression - e.g. entropy coder can be seen as a way to enumerate combinations: the number of length n 01 sequences having Pr('1')=k/n of '1' is
binomial(n,k) ~ 2^(n*h(k/n))
where h(p) = -p lg(p) - (1-p) lg(1-p) is Shannon entropy.

What I'm saying here is that the encoder doesn't need to perform this restriction.
Instead, it can just write some amount of pieces of information about the message, which would be sufficient to determine a unique meaningful message (in agreement with the model).
Only decoder uses the statistical model here - which determines the subspace of possible messages to search from.
The encoder doesn't need to know the exact model here, however, it needs to know the approximate ratio: what amount of pieces of information should he use to uniquely determine a single meaningful message.

Regarding lossy compression, it is usually done by a quantizer in real applications, like chopping less significant bits of DCT coefficients - then you use standard entropy coding, so one could use the discussed possibility.
The general rate distortion theory allows for a more sophisticated lossy compression - the encoder looks for a shorter descriptor, which would be decoded into something close to the original message - I currently don't see any reasonable way to combine them.

7. What parts specifically are you trying to combine?

8. I assume you are referring to the rate distortion ...
What I could currently do is all or nothing: complete decoding or (nearly) none - depending if the encoder has provided sufficient rate: number of pieces of information ("hints").
It would be great to include graceful degradation lossy compression there - make that:
- if the rate is sufficient, you get the exact message,
- if it turns out insufficient, you get a deformed message - the larger rate, the closer it is to the original message.
However it seems hard - it should be doable, but I currently don't know how to realize it.
Like when you mishear what somebody has told, the meaning can be very far.

9. ah, now i get what you're trying to do.
maybe a two pass system could work, where the minimum rate is calculated then used throughout the whole message. But I'd assume that'd be equivalent to Shannon's limit.

10. This would be an extra feature here and the general problem is that it is very difficult to design robust codings: maintaining metric - that close inputs lead to close outputs.

One could get graceful degradation with so called "unequal error protection" - introducing a hierarchy of importance among parts of the message and correspondingly gradate their protection levels - however it is far from the optimum.

11. You happen to have reminded me of an old idea I once had, about 2 years ago I was curious on whether an MDCT transform could be used for lossless compression which I guess ties in with what you're trying to accomplish here. I'm genuinely curious to see how this will turn out.

12. Fourier transform like DCT is unitary - maintains distances, so quantization itself behaves well (used in most of lossy image and sound compression).
However, you finally need to encode the coefficients - getting to the separate: information theory problems, this thread was intended to focus on.
Unless having potential of some real applications, I currently don't plan going forward in this direction.

13. A typical problem of the modeling phase is that it creates data dependencies, and hence dependencies in the data flow that limit parallelization of the encoder (and also of the decoder). Thus, if you can skip modelling at the encoder, you move the problem of resolving data-dependencies to the decoder. This might be interesting where you need to have a very high-speed encoder, but can deal with a lower-speed encoder. A typical use-case for such a szenario would be the accquisition of high-speed video (high-speed cameras) that are connected over a limited-bandwidth cable to a main unit. Accquisition, transport and storage has to be very fast, but replay can usually happen at a PC with considerably higher computing power available.

BTW, this sounds like an excellent topic for next year's DCC (Data compression conference), including your PCS work.

14. ## Thanks (2):

Bulat Ziganshin (3rd September 2015),Jarek (3rd September 2015)

15. The problem is that it is very costly for decoder and practically reachable ratios are lower - the income from using inter-slice information would be probably much smaller than loss.
And video compression research requires a large group, while I'm practically alone with coding theory here.

16. What about audio compression over wireless connection? This system could increase signal content for Bluetooth applications (which are generally much slower than wifi connections) assuming a certain compression rate can be achieved.

17. Lucas, I don't understand. Static entropy coder should be satisfactory for audio compression.

I have some hints for real applications - the original new possibilities of JRC:

above scenarios are unavailable for standard error correction, which needs the encoder to choose redundancy levels, which should be the same for all receivers ... while various receiver-dependent redundancy levels are required in the above scenarios.
They are available for JRC thanks to using "encoder independent error correction" - where the decoder adapts to the actual noise levels.
Encoder just writes some universal hints about the content of the message, each decoder individually wants to make the best of the received hints.

We can add "encoder independent data compression" from this thread to above scenarios: make that decoder tries to reconstruct considering only messages within some model - restricting the space of possibilities and so the required information. In other words, while e.g. broadcaster sends universal JRC packets, each receiver can treat them not only as protected for the required individual damage level, but also as compressed in a way optimized for the prior information of each individual receiver - both using the learned model, but also complementing the missing information.

Example of practical application: distributing a software update (without unwanted P2P like WX :/ ) ...
So various users have different versions/platforms which some of files are overlapping (prior information) - the distributor could broadcast universal packets, such that each receiver would require only amount of packets corresponding to the actually missing information.
For the data storage example, it could e.g. allow to reconstruct from multiple damaged copies, including some not encoded with JRC.
Any more applications?

18. Ok, lets write some details – changes from the current JRC to include a model by decoder.
Comparing to v2 of http://arxiv.org/pdf/1505.07056.pdf :

Weight function (4) evaluating candidates x for the next length N bit block of the message:
W(E) = M - N + lg(Pr(bits of received files assuming E))
Becomes
W(E,x) = M + lg(Pr(x)) + lg(Pr(bits of received files assuming x))
Where Pr(x) is from our model (were 2^-N without it).

The analysis of statistical behavior:
( 7 ) has 2^N assuming uniform probability distribution of x – there should be "sum_x" instead
( 8 ) should have additional "sum_x Pr(x)" to average over x
and both should have W(E) replaced with W(E,x)
Finally, the N in (9) and (10) becomes replaced by u-th Renyi entropy of the model (which is N for uniform probability distribution).

Sequential decoding is approximately described by Pareto coefficient c, such that u = 1/(1+c). For c = 0 we have Shannon, for c = 1 we have so called cutoff rate.
Bidirectional decoding can easily work with c = 1/2, here are ratios (required bits/symbol) compared with the optimal: Shannon entropy for binary probability distribution.

Maybe LDPC-based decoder could do a bit better (?)

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•