1. ## Encoding question

Hi,

I'd like to ask a question that I tried to answer myself but couldn't come up with any solution.

I'd like to know of any algorithm (or if it's possible at least to prove whether one does or doesn't exist), with these properties

status_in --> [ ALGORITHM ] --> status_out

1) "status_out" is 1 bit larger or 1 bit smaller than the original "status_in" with a random 50% chance
2) from "status_out" I can always go back to "status_in"

Sorry in advance if the question is not well formed and maybe lacks some important details, but those are basically the only two properties that I'm interested in and I'm not able to rephrase the question more precisely.

Thank you in advance for any help, and please let me know if there are any more details that I could give to make my question more clear.  Reply With Quote

2. easy: drop higher bit if it is 0, duplicate it otherwise  Reply With Quote

3. No, it is not possible to shorten half of all possible files by one bit. Bulat's proposed solution cannot distinguish between a dropped 0 and a duplicated 1 when the altered file starts with two or more 1s.

Assuming the data is random, the closest you can come to this is to create a scheme that compresses one less than half the files by 1 bit and approximately doubles the size of the remaining files or a scheme that compresses one third of the files by 1 bit and adds one bit to the other two thirds of the possible files.  Reply With Quote

4. Originally Posted by Kennon Conrad the closest you can come to this is to create a scheme that compresses one less than half the files by 1 bit and approximately doubles the size of the remaining files or a scheme that compresses one third of the files by 1 bit and adds one bit to the other two thirds of the possible files.
Could you please give an (practical) example of both cases?  Reply With Quote

5. It seems to me you have to impose the restriction that the last two bits of the word of status_in be always zero,
and the eventually added bit be 1. So when the last bit of status_out is 1 it will be removed, when 0 it will be
followed by another 0.  Reply With Quote

6. You can prove that it's impossible (assuming you don't know the size of the sequence - if you do, Bulat's solution works)

Simple counting - there are 2^(N-2) + 2^N ways of getting a N-bit output - half the (N-1)-bit sequences that grow by one, and half the (N+1)-bit sequences that shrink by one bit. You cannot recover 2^(N-2) + 2^N possible combinations uniquely using N bits, so the transformation you propose is not reversible if you do not have some extra information (you need about a third of a bit extra - log2(1.25) to be precise, which probably explains the numbers behind Kennon Conrad's second scheme).  Reply With Quote

7. If you know the size of the sequence then the compressor could drop all leading zeros and you wouldn't need to add an extra 1 to files that start with 1's.

I was wrong about the scheme that can compress one less than half the combinations by 1 bit doubling the length of the other combinations. The size needed to express the other combinations is generally much more than double. An example scheme would be to do the following:

If the input starts with 0 and has at least one 1, drop the first 0.
Otherwise, if the input is all 0's, write (L-1) + 2L-1 0's, where L is the input size in bits.
Otherwise, write L + N 0's, where L is the input size in bits and N is the input value.

and the converted file size growth for those that don't shrink by a bit is equal to 2 to the power of the input size in bits.

The second case is much easier. For all input values < 010101010101...., the leading zero would be dropped. Otherwise an extra 1 is added to the start. To decode, for input values < 1010101010... add a leading zero, otherwise drop the leading one. One third of the values (rounded down) will be < 01010101...., so 1/3 of files will be become 1 bit smaller and 2/3 of files will become 1 bit larger. This is the limit because 3 * 0.5N = 1 * 0.5N-1 + 2 * 0.5N+1 (= 1 * 0.5-1 * 0.5N + 2 * 0.5 * 0.5N = 2.0 * 0.5N + 1.0 * 0.5N), ie. for every code that is reduced by one bit you need two codes to be increased by one bit to equalize the total probability.

A much easier way to think about this for me is to think of a painter painting boards. Suppose he has two boards and enough paint to paint two boards. If he decides to paint one board twice, he is out of paint and can't paint the other board. If he has 2.01 boards and enough paint for 2.01 boards, he could paint one board twice and sprinkle a tiny bit of paint on the remaining 1.01 boards. If he has three boards and enough paint for three boards, he can paint one board twice and put a half coat on the other two boards before running out of paint. Unless he knows some boards should be painted better than others, he's probably best off applying the paint equally.  Reply With Quote

8. Originally Posted by Kennon Conrad A much easier way to think about this for me is to think of a painter painting boards. Suppose he has two boards and enough paint to paint two boards. If he decides to paint one board twice, he is out of paint and can't paint the other board. If he has 2.01 boards and enough paint for 2.01 boards, he could paint one board twice and sprinkle a tiny bit of paint on the remaining 1.01 boards. If he has three boards and enough paint for three boards, he can paint one board twice and put a half coat on the other two boards before running out of paint. Unless he knows some boards should be painted better than others, he's probably best off applying the paint equally.
Thank you Conrad for your example with boards; I think it's very nice. Your example was with

2 boards, enough paint for 2 boards
2 boards, enough paint for 2 boards + a tiny bit
3 boards, enough paint for 3 boards

so by extension I was thinking of

2 boards, enough paint for 3 boards
2 boards, enough paint for 4 boards

and this made me wonder if the problem could be solved with different requirements. For example, instead of my original question "1/2 lose 1 bit, 1/2 gain 1 bit", something else like "1/3 lose 1 bit, 1/3 gain 1 bit, 1/3 invariant (nothing changed)", or maybe if the problem could be solved with a higher amount of gained/lost bits, for example "1/2 lose 2 bits, 1/2 gain 2 bits". Also, I'm thinking in "boards" right now and I'm having some difficulties converting this different idea into practice; so if you think this is any relevant, I'd also appreciate any practical example with bits and not just boards =)  Reply With Quote

9. I would use Huffman construction to solve this.

The simple idea is as follows :
- a symbol with "normal" nb of bits is worth 2 points
- a symbol using one less bit is worth 4 points
- a symbol using one more bit is worth 1 point
- the total number of points never changes

So, assuming you have an alphabet of 256 symbols (a full byte), you have 512 points to distribute, exactly

A basic building block is :
- 4 normal symbols = 8 points
- is equivalent to : 1 "large" (4 points), 1 "normal" (2 points), 2 "small" (2x1 point) => still 4 symbols and 8 points

And now you can apply this transformation any way you wish.
For example, for 256 symbols, a simple pass gives 64 symbols with 7 bits, 64 symbols with 8 bits, 128 symbols with 9 bits.
Re-applying the formula recursively, it's possible to reach 85 symbols with 7 bits, 1 symbol with 8 bits, 170 symbols with 9 bits.

It's also readily apparent that "1/2 with one more bit + 1/2 with one less bit" doesn't make it, since it would require : 128x4 + 128x1 = 640 points > 512 points  Reply With Quote

10. @Cyan would some other arrangement like "1/3 with one more bit + 1/3 with one less bit + 1/3 with no change" be viable instead? Or any other "equal" proportion (unlike 1/3+2/3)?  Reply With Quote

11. If all patterns have equal probability but you assign double the "points" relative to what a pattern deserves, then you must assign half the "points" to two other patterns.

So in Cyan's case with an alphabet of 256 symbols, these are the options:
0 @ 7 bits, 256 @ 8 bits, 0 @ 9 bits (optimal)
1 @ 7 bits, 253 @ 8 bits, 2 @ 9 bits
2 @ 7 bits, 250 @ 8 bits, 4 @ 9 bits
3 @ 7 bits, 247 @ 8 bits, 6 @ 9 bits
....
85 @ 7 bits, 1 @ 8 bits, 170 @ 9 bits

The only way to obtain an equal number of 7 bit and 9 bit codes is to not change the code lengths.  Reply With Quote

12. A somewhat different question, although related to this thread.

If I have two symbols with equal probability, I encode them as (A=0, B=1). So, the string "AB" would be encoded as AB = 01 = 2 bits

Is it possible to change the symbols in such a way that

1- one symbol is encoded with less than 1 bit
2- the other symbol is encoded with more than 1 bit
3- the string AB still is encoded with 2 bits, AB = .... = 2 bits (perhaps 2 bits "on average"?)

for example, would it be possible something like this?

(A=0.8, B=1.2), AB = (0.8+1.2) = 2

I hope my question makes sense  Reply With Quote

13. It depends on how one interprets the question. It is possible to have probabilities of (for example) 0.625 for "A", 0.125 for "B" and 0.25 for "AB". If this was done, "A" could be encoded with 0.67 bits, "B" could be encode with 3.00 bits and "AB" could be encode with 2.00 bits so this meets conditions #1 - #3.

However, this is not consistent with your example where you imply the cost of encoding AB should be the sum of the cost of encoding A and the cost of encoding B. If this is the case and you have probability p for symbol A, then B will have probability of (1 - p) and the cost of encoding AB is log2(1/p) + log2(1/(1-p)). Here is a plot of the cost of encoding AB as a function of the cost of encoding A for probabilities 1% - 99%: The only point on the plot where the cost of sending AB is 2.0 bits occurs where the cost of sending A is 1.0 bits. Matt Mahoney has a few sentences on the subject in his Data Compression Explained ebook in the first paragraph here: http://mattmahoney.net/dc/dce.html#Section_32  Reply With Quote

14. Arithmetic coding (aka Rangecoding) and ANS will permit you to encode symbols with a fractional bit, but they all have a very slight cost due to periodic normalisation and fitting things into integers. That said, it's only the case of equal probabilities of A and B where they sum to 1. In other cases they will sum to less than one bit.

If in doubt, try it! Write a noddy one-liner in your favourite language that spits out "A" 40% of the time and "B" the other 60% to a file. Create say 1MB of that and chuck it through various entropy encoders to see if it meets what we know the shannon limit is. It ought to be 400000*1.322 + 600000*0.737 bits == 970951 bits - a shade under 1 bit per symbol. As the probabilities get more extreme, eg 10% & 90%, the total data entropy should get smaller (approx 469 Kbits for that 10/90% case).

Edit: sorry I just read the "with equal probability" statement. In that case you'll always get the same size for each barring some trivial rounding errors (which will likely not be in your favour).  Reply With Quote

algorithm, encoding 