# Thread: optimum mimimum costs representation

1. ## optimum mimimum costs representation

Hi :

what is the best optimum best minimum costs representation of series of numbers in following general proportions :

191 103 53 20 16 3 4 1 1 1

OR can be

229 96 48 22 7 9 2 3 1

OR can be

190 96 42 29 16 6 4 2 1 0 1

OR can be

221 93 43 30 10 10 2 3

OR can be

192 96 50 28 15 3 5 2 0 1

you may also be representing many many such series all together ( not just 1 single )  Reply With Quote

2. Is there a fixed upper limit for the first number? Is it 255? Or is it unbounded?

What is the probability distribution of the first number?

What is the probability of an out-of-order number? (Such as "4" in the first sequence.)
Is an out-of-order number usually small?
Can you provide a(n approximating) formula of the conditional probability of an out-of-order number?

Can you approximate the proportion of the next value given a value (191->103; 103->53)? For the larger number it seems to be around 0.5. Is it? Does the proportion of the next value depend only on the current value? Does the presence of the out-of-order values come from these proportions? If that is true, the previous question may be merged to this one: modeling the proportion of the next value could be enough, so out-of-order values may not need a different modelling.

Do you know the sequence length in advance, or will we need a symbol (a "flag") for the end-of-sequence?

If you have a very large sample, or if you can generate a very large sample from data, you can measure/approximate the probabilities of the "next" values, out-of-order values and end-of-sequence symbols. When you know the above probabilities, you will be able to compress the sequence very nicely with questions (like: "is it the end" or "is the next value larger than the current value/2?) and answers (like: "no", "no")) and an arithmetic encoder.  Reply With Quote

3. thought with sufficient samples methods like 'least squares estimates' or similar methods gives best economic representations ? Or some newer improved method ?

one can then just utilise the standard 'Wolfram' functions etc

( yes there is an upper bound on the 1st number in the series and its 450 )  Reply With Quote

4. For optimum compression, you'll have to know the probabilities.

If these probabilities vary sample to sample, then yes, you will need an online estimator.
But if these probabilities are fixed and independent of the "current sample" (i.e. the probability distribution is true for all the samples), then you don't need an online estimator. It looks like this is the case. Is it?
In this case with sufficient number of samples, you can calculate (estimate) the probability distribution in advance and use these fixed probabilities when compressing one or more series.

To find out the proper probability distribution, you'll only need to find out what the probability depends on. Like: does the probability of "the sequence ends" depend only on the last value and not on previous values or the length of the sequence so far? If it is true you'll need to calculate the conditional probability: P(end | last value so far). If it depends on the value index as well (how long is the current sample so far), then it will be: P(end | value, sequence length so far). Any such useful condition ("context") that makes the prediction better, may be used, and when you will have the conditional probabilities calculated from your samples, like (P(end|last_value so far=3,sequence length so far=7) = 97%, than you are done. This is everything you need.

For calculating the proper conditional probabilities you'll need a large number of samples and you'll need your intuition: what the probabilities depend on. You can utilize different advanced methods, but you'll still need to provide such useful contexts to them (previous value, sequence length, and anything that you think would be useful) so they can properly estimate the conditional probability.

450 looks promising - when we set 450 as the zeroth number, it looks like the proportion of the first value is about 0.5 of 450. And it looks like it still holds true for the whole sequence: the proportion of the next value is around 0.5 of the previous value.

Since your samples look really nice, I would not go too far, I'd just use excel and plot the conditional probability distributions for the next value conditioned on the current value (to see - it may look like a simple known distribution). I have the feeling (a hypothesis ) that the end of sequence flag can also be modeled properly by simply calculating the proportions: as the predicted value (a real value, not an integer) is less and less, the probability of "end" is higher and higher.  Reply With Quote

5. I believe, we might have a simple normal distribution here.
If that is true, to approximate the next value from the previous value:

To find the next value use a normal distribution with mean = (current value)/2 and variance: 4.0. Except for the first value, where the variance is probably around 18-19.
Use an estimation formula for the cumulative distribution (cdf) of the normal distribution, so that you can calculate the probability that the next sequence value lies within a specific range, and by range splitting and an arithmetic encoder you can encode the value.

Range splitting goes like this for the first number of the first sequence (191): Base value (mean for the normal distribution): 450/2=225.
Is the next value in 0..224? Yes. Is the next value in 0..112? No. (So next value is in 113..224). Is the next value in 113..168? No. (So the next value is in 169.224). And so on.

For the arithmetic encoder you will need to calculate the probability that your next value is in the first part of each range and not in the second. You will need to use the normal distribution cdf formula: what is the probability that a sample from a normal distribution with mean 225 and variance 18 lies in the range 0..224 vs 225..450 (it's about 0.5)? Then: 0..112 vs 113..224 and so on.

The probability that the sequence stops cannot be estimated from these small samples.
It is also possible that the next value does not depend on the previous value only on the sequence index of the value. It's difficult to see that from this small sample.  Reply With Quote

6. >>I believe, we might have a simple normal distribution here.

Yes it is

>>If that is true, to approximate the next value from the previous value:

>>The probability that the sequence stops cannot be estimated from these small samples.
It is also possible that the next value does not depend on the previous value only on the sequence index of the value. It's difficult to see that from this small sample.[/QUOTE]

depends on both sequence Index and the previous value  Reply With Quote

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•