Results 1 to 14 of 14

Thread: Encoding question

  1. #1
    Member
    Join Date
    Sep 2016
    Location
    China
    Posts
    9
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Question Encoding question

    Hi,

    I'd like to ask a question that I tried to answer myself but couldn't come up with any solution.

    I'd like to know of any algorithm (or if it's possible at least to prove whether one does or doesn't exist), with these properties

    status_in --> [ ALGORITHM ] --> status_out

    1) "status_out" is 1 bit larger or 1 bit smaller than the original "status_in" with a random 50% chance
    2) from "status_out" I can always go back to "status_in"

    Sorry in advance if the question is not well formed and maybe lacks some important details, but those are basically the only two properties that I'm interested in and I'm not able to rephrase the question more precisely.

    Thank you in advance for any help, and please let me know if there are any more details that I could give to make my question more clear.
    Last edited by cassini; 16th September 2016 at 03:26. Reason: formatting

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    783
    Thanked 687 Times in 372 Posts
    easy: drop higher bit if it is 0, duplicate it otherwise

  3. #3
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    No, it is not possible to shorten half of all possible files by one bit. Bulat's proposed solution cannot distinguish between a dropped 0 and a duplicated 1 when the altered file starts with two or more 1s.

    Assuming the data is random, the closest you can come to this is to create a scheme that compresses one less than half the files by 1 bit and approximately doubles the size of the remaining files or a scheme that compresses one third of the files by 1 bit and adds one bit to the other two thirds of the possible files.
    Last edited by Kennon Conrad; 16th September 2016 at 08:59.

  4. #4
    Member
    Join Date
    Sep 2016
    Location
    China
    Posts
    9
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Kennon Conrad View Post
    the closest you can come to this is to create a scheme that compresses one less than half the files by 1 bit and approximately doubles the size of the remaining files or a scheme that compresses one third of the files by 1 bit and adds one bit to the other two thirds of the possible files.
    Could you please give an (practical) example of both cases?

  5. #5
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    It seems to me you have to impose the restriction that the last two bits of the word of status_in be always zero,
    and the eventually added bit be 1. So when the last bit of status_out is 1 it will be removed, when 0 it will be
    followed by another 0.

  6. #6
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    69
    Thanks
    16
    Thanked 21 Times in 16 Posts
    You can prove that it's impossible (assuming you don't know the size of the sequence - if you do, Bulat's solution works)

    Simple counting - there are 2^(N-2) + 2^N ways of getting a N-bit output - half the (N-1)-bit sequences that grow by one, and half the (N+1)-bit sequences that shrink by one bit. You cannot recover 2^(N-2) + 2^N possible combinations uniquely using N bits, so the transformation you propose is not reversible if you do not have some extra information (you need about a third of a bit extra - log2(1.25) to be precise, which probably explains the numbers behind Kennon Conrad's second scheme).

  7. #7
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    If you know the size of the sequence then the compressor could drop all leading zeros and you wouldn't need to add an extra 1 to files that start with 1's.

    I was wrong about the scheme that can compress one less than half the combinations by 1 bit doubling the length of the other combinations. The size needed to express the other combinations is generally much more than double. An example scheme would be to do the following:

    If the input starts with 0 and has at least one 1, drop the first 0.
    Otherwise, if the input is all 0's, write (L-1) + 2L-1 0's, where L is the input size in bits.
    Otherwise, write L + N 0's, where L is the input size in bits and N is the input value.

    and the converted file size growth for those that don't shrink by a bit is equal to 2 to the power of the input size in bits.

    The second case is much easier. For all input values < 010101010101...., the leading zero would be dropped. Otherwise an extra 1 is added to the start. To decode, for input values < 1010101010... add a leading zero, otherwise drop the leading one. One third of the values (rounded down) will be < 01010101...., so 1/3 of files will be become 1 bit smaller and 2/3 of files will become 1 bit larger. This is the limit because 3 * 0.5N = 1 * 0.5N-1 + 2 * 0.5N+1 (= 1 * 0.5-1 * 0.5N + 2 * 0.5 * 0.5N = 2.0 * 0.5N + 1.0 * 0.5N), ie. for every code that is reduced by one bit you need two codes to be increased by one bit to equalize the total probability.

    A much easier way to think about this for me is to think of a painter painting boards. Suppose he has two boards and enough paint to paint two boards. If he decides to paint one board twice, he is out of paint and can't paint the other board. If he has 2.01 boards and enough paint for 2.01 boards, he could paint one board twice and sprinkle a tiny bit of paint on the remaining 1.01 boards. If he has three boards and enough paint for three boards, he can paint one board twice and put a half coat on the other two boards before running out of paint. Unless he knows some boards should be painted better than others, he's probably best off applying the paint equally.

  8. #8
    Member
    Join Date
    Sep 2016
    Location
    China
    Posts
    9
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Kennon Conrad View Post
    A much easier way to think about this for me is to think of a painter painting boards. Suppose he has two boards and enough paint to paint two boards. If he decides to paint one board twice, he is out of paint and can't paint the other board. If he has 2.01 boards and enough paint for 2.01 boards, he could paint one board twice and sprinkle a tiny bit of paint on the remaining 1.01 boards. If he has three boards and enough paint for three boards, he can paint one board twice and put a half coat on the other two boards before running out of paint. Unless he knows some boards should be painted better than others, he's probably best off applying the paint equally.
    Thank you Conrad for your example with boards; I think it's very nice. Your example was with

    2 boards, enough paint for 2 boards
    2 boards, enough paint for 2 boards + a tiny bit
    3 boards, enough paint for 3 boards

    so by extension I was thinking of

    2 boards, enough paint for 3 boards
    2 boards, enough paint for 4 boards

    and this made me wonder if the problem could be solved with different requirements. For example, instead of my original question "1/2 lose 1 bit, 1/2 gain 1 bit", something else like "1/3 lose 1 bit, 1/3 gain 1 bit, 1/3 invariant (nothing changed)", or maybe if the problem could be solved with a higher amount of gained/lost bits, for example "1/2 lose 2 bits, 1/2 gain 2 bits". Also, I'm thinking in "boards" right now and I'm having some difficulties converting this different idea into practice; so if you think this is any relevant, I'd also appreciate any practical example with bits and not just boards =)

  9. #9
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    483
    Thanked 279 Times in 119 Posts
    I would use Huffman construction to solve this.

    The simple idea is as follows :
    - a symbol with "normal" nb of bits is worth 2 points
    - a symbol using one less bit is worth 4 points
    - a symbol using one more bit is worth 1 point
    - the total number of points never changes

    So, assuming you have an alphabet of 256 symbols (a full byte), you have 512 points to distribute, exactly

    A basic building block is :
    - 4 normal symbols = 8 points
    - is equivalent to : 1 "large" (4 points), 1 "normal" (2 points), 2 "small" (2x1 point) => still 4 symbols and 8 points

    And now you can apply this transformation any way you wish.
    For example, for 256 symbols, a simple pass gives 64 symbols with 7 bits, 64 symbols with 8 bits, 128 symbols with 9 bits.
    Re-applying the formula recursively, it's possible to reach 85 symbols with 7 bits, 1 symbol with 8 bits, 170 symbols with 9 bits.

    It's also readily apparent that "1/2 with one more bit + 1/2 with one less bit" doesn't make it, since it would require : 128x4 + 128x1 = 640 points > 512 points

  10. #10
    Member
    Join Date
    Sep 2016
    Location
    China
    Posts
    9
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @Cyan would some other arrangement like "1/3 with one more bit + 1/3 with one less bit + 1/3 with no change" be viable instead? Or any other "equal" proportion (unlike 1/3+2/3)?
    Last edited by cassini; 17th September 2016 at 14:05. Reason: more info

  11. #11
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    If all patterns have equal probability but you assign double the "points" relative to what a pattern deserves, then you must assign half the "points" to two other patterns.

    So in Cyan's case with an alphabet of 256 symbols, these are the options:
    0 @ 7 bits, 256 @ 8 bits, 0 @ 9 bits (optimal)
    1 @ 7 bits, 253 @ 8 bits, 2 @ 9 bits
    2 @ 7 bits, 250 @ 8 bits, 4 @ 9 bits
    3 @ 7 bits, 247 @ 8 bits, 6 @ 9 bits
    ....
    85 @ 7 bits, 1 @ 8 bits, 170 @ 9 bits

    The only way to obtain an equal number of 7 bit and 9 bit codes is to not change the code lengths.
    Last edited by Kennon Conrad; 18th September 2016 at 04:41.

  12. #12
    Member
    Join Date
    Sep 2016
    Location
    China
    Posts
    9
    Thanks
    0
    Thanked 0 Times in 0 Posts
    A somewhat different question, although related to this thread.

    If I have two symbols with equal probability, I encode them as (A=0, B=1). So, the string "AB" would be encoded as AB = 01 = 2 bits

    Is it possible to change the symbols in such a way that

    1- one symbol is encoded with less than 1 bit
    2- the other symbol is encoded with more than 1 bit
    3- the string AB still is encoded with 2 bits, AB = .... = 2 bits (perhaps 2 bits "on average"?)

    for example, would it be possible something like this?

    (A=0.8, B=1.2), AB = (0.8+1.2) = 2


    I hope my question makes sense

  13. #13
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    It depends on how one interprets the question. It is possible to have probabilities of (for example) 0.625 for "A", 0.125 for "B" and 0.25 for "AB". If this was done, "A" could be encoded with 0.67 bits, "B" could be encode with 3.00 bits and "AB" could be encode with 2.00 bits so this meets conditions #1 - #3.

    However, this is not consistent with your example where you imply the cost of encoding AB should be the sum of the cost of encoding A and the cost of encoding B. If this is the case and you have probability p for symbol A, then B will have probability of (1 - p) and the cost of encoding AB is log2(1/p) + log2(1/(1-p)). Here is a plot of the cost of encoding AB as a function of the cost of encoding A for probabilities 1% - 99%:

    Click image for larger version. 

Name:	AB_cost.jpg 
Views:	85 
Size:	61.9 KB 
ID:	4687

    The only point on the plot where the cost of sending AB is 2.0 bits occurs where the cost of sending A is 1.0 bits. Matt Mahoney has a few sentences on the subject in his Data Compression Explained ebook in the first paragraph here: http://mattmahoney.net/dc/dce.html#Section_32

  14. #14
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    Arithmetic coding (aka Rangecoding) and ANS will permit you to encode symbols with a fractional bit, but they all have a very slight cost due to periodic normalisation and fitting things into integers. That said, it's only the case of equal probabilities of A and B where they sum to 1. In other cases they will sum to less than one bit.

    If in doubt, try it! Write a noddy one-liner in your favourite language that spits out "A" 40% of the time and "B" the other 60% to a file. Create say 1MB of that and chuck it through various entropy encoders to see if it meets what we know the shannon limit is. It ought to be 400000*1.322 + 600000*0.737 bits == 970951 bits - a shade under 1 bit per symbol. As the probabilities get more extreme, eg 10% & 90%, the total data entropy should get smaller (approx 469 Kbits for that 10/90% case).

    Edit: sorry I just read the "with equal probability" statement. In that case you'll always get the same size for each barring some trivial rounding errors (which will likely not be in your favour).

Similar Threads

  1. Brotli literal and offset encoding
    By geza in forum Data Compression
    Replies: 10
    Last Post: 21st June 2016, 15:17
  2. Replies: 38
    Last Post: 27th April 2016, 18:01
  3. Seeking image encoding benchmark
    By boxerab in forum Data Compression
    Replies: 4
    Last Post: 30th October 2015, 15:31
  4. Codebok encoding
    By kredens in forum Data Compression
    Replies: 1
    Last Post: 29th October 2015, 08:57
  5. Advanced Huffman Encoding
    By Simon Berger in forum Data Compression
    Replies: 28
    Last Post: 15th April 2009, 14:24

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •