Results 1 to 30 of 30

Thread: Concatenating strings

  1. #1
    Member
    Join Date
    Sep 2014
    Location
    Europe
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Question Concatenating strings

    Hello everyone!

    I'd like to share a post that I've recently submitted to SO. I would really appreciate any idea or hint to solve this problem.

    https://stackoverflow.com/questions/...trieve-strings

    Another SO user, Oli, has posted a very useful comment that however seems to work only for longer strings (10s of symbols or more). I was wondering whether there is a solution for shorter strings, for example if the average length is 5 symbols.

    Thank you in advance for any comment.

  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,497
    Thanks
    26
    Thanked 132 Times in 102 Posts
    I don't know if your problem is solvable but let's try.

    First problem: are zero lengths strings allowed?

    Regardless, you have to design a prefix code. You've said your codes can't expand the input by more than two bits. So each string of length N symbols can be encoded to output of maximum 2 * (N + 1) bits. Since there's no gain from assigning codes shorter than maximum allowed in regard to correctness of solution, we should just assign maximum code length to every input string and then look if we can build prefix code.

    So let's assign codes, assuming zero-length strings are possible:
    Code:
    - -> 00 (zero length input string)
    0 -> 0100
    ... (there are 4 1-letter strings: from 0 to 3)
    3 -> 0111
    00 -> 100000
    ... (there are 16 2-letter strings: from 00 to 33)
    33 -> 101111
    000 -> 11000000
    ... (there are 64 3-letter strings: from 000 to 333)
    333 -> 11111111
    We're out of codespace now. We can't encode strings longer than 3 symbols if we allow zero-length strings and fill codespace sequentially.

    If we disallow zero-length string then situation will be different. A program would be useful there, but I'm not so inclined :]

  3. #3
    Member
    Join Date
    Sep 2014
    Location
    Europe
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    First problem: are zero lengths strings allowed?
    Good point. No, in my particular case I don't consider null strings. Only strings of length 1 or more are relevant.

  4. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,497
    Thanks
    26
    Thanked 132 Times in 102 Posts
    Either way, because 1 to 3 symbol length input strings can't be compressed you need to use those 2 extra bits to tell the length of input string.
    00 - for 1 symbol long input strings,
    01 - for 2 ..
    10 - for 3 ..
    11 - for 4 symbol long or longer input strings.

    So the question can now be rephrased (remember: assuming empty input strings are disallowed): can we build prefix code that don't expand data (as we already used those 2 extra bits) and unambiguously encodes every 4 symbol long or longer input string and how?

    My guess is: we can't.

  5. #5
    Member
    Join Date
    Sep 2014
    Location
    Ukraine
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts
    you basically want to encode 5 letter alphabet message using at most 2 bits per symbol. even with additional restrictions that you have provided, i think it is impossible (i can't prove it though, just my intuition).

    however, if you impose ordering on your strings, you can get very close to 2 bits per symbol.

    1) encode all strings containing only up to 2 different symbols, in lexicographic order. use 1st unused symbol as end of string marker, 2nd unused symbol as end of stream of "strings that can be delimited by 1st unused symbol". also, there's no need to output the last "end of string" before "end of stream".

    2) encode the rest of strings, again, in lexicographic order, using the unused 4th symbol as end of string marker. however, unlike the first part, you can't use another symbol as end of stream of "strings delimited by 4th symbol". an obvious (though maybe not the best) solution is to use 4th symbol two times in a row as end of stream marker (essentially encoding empty string to mark end of stream). sadly, each such end of stream marker is 2 bits over your limit. but since there's maximum of 4 of such streams, that's only 8 extra bits regardless of number of strings.

  6. #6
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    Any string S of random values (0, 1, 2, 3) can itself be broken down into sub-strings containing no more than 3 out of the 4 symbols in it, ie S = (S1, S2, S3, .. Sn). Yet you are asking for "at most" 2 bits per character in the sub-string and hence by definition at most 2 bits per character in S, which I'll remind you is totally random from 4 symbols.

    Basically it's impossible unless the "at most" is "equal to" 2 bits. Then it's trivial unless we want the string length to be an encoded value in there somewhere too (in which case it's impossible too).

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,497
    Thanks
    26
    Thanked 132 Times in 102 Posts
    OK. I've actually written the code that proves it's impossible to construct prefix code satisfying your restrictions.

    Here it is: http://ideone.com/2sfDag
    It writes "error" at the moment of codespace exhaustion.

  8. #8
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    One thing I notice is that this could be expressed differently, and a better and more succinct description might promote clarity of thought and also make the constraints more obvious. As I understand, you don't so much have a set as a sequence, which means elements may repeat and order is significant. You didn't say retrieval must be fast, so if I understand you, you could imagine the strings already concatenated and separated with end symbols to make a single large string. The problem is then simply to compress this string. Is that accurate?

    Piotr's point about prefix codes seems on the right track. But there are infinitely many strings, so a more pertinent comparison might be to universal codes like Elias delta or Fibonacci. My gut feeling is that the constraints are not enough to guarantee the number of bits you are looking for.

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,497
    Thanks
    26
    Thanked 132 Times in 102 Posts
    My quick experiments suggest that with extra 3 bits (instead of extra 2 bits) encoding is possible.

    (In my code) simply increase starting value of codeLen from 4 to 5, so that the difference between inputLen and codeLen is equal to 3 (instead of 2) and the main loop magically runs for lots of iterations.

    Further experiments show that (alas) if you allow empty input strings then even 3 extra bits aren't enough. And if we go to 4 bits then we can use the simple solution you've described on SO (ie for every substring, prepend and append a symbol not appearing in the substring to that substring - that is trivially correct).
    Last edited by Piotr Tarsa; 7th September 2014 at 22:59.

  10. #10
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    if it can be done then it can be done with a special case of bijective arithmetic compressor/
    looking at worse cases only for a single string
    -log(1/4) 2 bits one token no eos token since only if follow up string is eos needed

    2 tokens
    -log(1/4 * (1/2)(1/3) )/log(2) = 4.5849625... since allowing already seen to be 2 bits same for EOS if followup string needed
    out of 6 bits allowed

    3 tokens
    -log(1/4* (1/2)*(1/3)* (1/4)*(1/2))/log(2) = 7.5849625... in this case 2 symbols known and EOS
    out of 8 bits allowed no each token add 2 bits and excess stays the same.

    after this worst case for a single string have same excesses since each symol used and eos have weight of two

    worse case only for two strings look at 3 tokens for each sting each sting of 3
    made of different characters.

    the fist string compress to 7.5849625... eos to 2 bits the secong string to 7.5849625...
    for a length of 17.169925 where 16 allowed

    You can not compress so that the max bits gained is no more than two per sting. If each subset made of there symbols not known in advance
    but given stings of 1 to 3 symbols only out of a set of 4 you can write a program to solve you problem.

    However for long strings and a different bijective compressor you can make the stings much smaller and then if each string of a certain length you could easily solve the problem.


    ERROR IN MY POST I feel bad and good at the same time. I went to sleep happy I did this. Then the nightmare in sleep realized that I made a mistake for worst case for several strings of 2 token but together. It was assuming that the EOS character when written had to be 2 bits long for string longer than one token wrong. I can write string worst case substrings of 2 tokens together withe EOS log 3 / log 2 bits = 1.58496250072 which means all substrings made of one or 2 tokens will expand the 2 bits max so I will rewrite this give me a day I woke to early to finish the 3 token strings. So unless you beat me to it. So it might be possible to do it bijectively and the 2 bit expansion limit might work but not sure.
    Last edited by biject.bwts; 8th September 2014 at 14:44. Reason: error in first post

  11. #11
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    OK. I've actually written the code that proves it's impossible to construct prefix code satisfying your restrictions.

    Here it is: http://ideone.com/2sfDag
    It writes "error" at the moment of codespace exhaustion.
    I didn't notice that you were claiming a proof. A proof ends the discussion, naturally. You might take a moment to justify your method, though, for the benefit of the silent readership as well as future generations. Code doesn't really speak for itself.

  12. Thanks:

    biject.bwts (8th September 2014)

  13. #12
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,497
    Thanks
    26
    Thanked 132 Times in 102 Posts
    The code just automates what I was doing manually in the first post in this thread and accepts different starting conditions. I didn't do any serious math.

  14. #13
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    While not a mathematical proof, it is obviously impossible.

    Let S be a string made up of random symbols from an alphabet of size 4. This clearly takes log2(4) bits (==2) to encode per symbol, plus a string length. (Or if we're using end-of-string as an alphabet character then it takes >2 bits per symbol).

    Set S' being the set of sub-strings of S such that within S' no more than 3 out of 4 alphabet symbols are present. The S' is ordered such that concatenatation of all elements of S' form the original S again.

    If it was possible to compress S' in anything smaller than S then we'd have invented recursive compression (we'd take the compression of S', convert it into base 2 and regenerate a new string S and repeat). We know this to be a lie, so the whole premise of S' being < 2 bits is also in error.

  15. #14
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,497
    Thanks
    26
    Thanked 132 Times in 102 Posts
    JamesB:
    If I understand correctly, the problem isn't about compressing substrings per se, but making a prefix code for the input strings (ie those that can't have all 4 different symbols at the same time) that doesn't expand any string by more than 2 bits.

    What's trivial to prove is that in general, the decomposition of S (where S can have all 4 different symbols present) can't be represented without expansion, unless we assume there's only one possible expansion for each S (in such case, we don't need to store the description of decomposition as it can be deduced from S). The reason for that is that decomposition can be done in many ways and describing it takes space in addition to the symbols themselves. That's very similar to the case with LZ codecs.

  16. #15
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    The code just automates what I was doing manually in the first post in this thread and accepts different starting conditions. I didn't do any serious math.
    I was also leaning toward a combinatorial/counting approach. The code has the most trouble at the beginning (as they always do), which makes exhaustive search possible. I tried thinking of a simple argument, but the constraints don't seem to yield one. Exhaustive search is rather hard to poke holes in, so that ends it.

  17. #16
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    JamesB:
    If I understand correctly, the problem isn't about compressing substrings per se, but making a prefix code for the input strings (ie those that can't have all 4 different symbols at the same time) that doesn't expand any string by more than 2 bits.
    Oops. Yes I misread that bit.

  18. #17
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    since you can have any number of one character string togther you need 2 bits to pick the one of 4 posible characters in each when you go to the next charact it can be 1/5 choices the EOS or still any of the 4 characters in the set put eos has to be 2 bits so in the limit you get 2 extra bits per string. Noe suppose you want to use 2.001 bits for the eof. That would be fine if you only had a small number of strings but finailly .001 * N nubmer of string will be greater thean say 10 then the excess viloates the require of 2 bits per string even if the last string had no bits. There for in the second postion you have to have 2 bits or 1/4 probility for the EOS reserved.

    This also means one would code any substring of one charact type as 2 bits for the character and 2 bits for the EOS except as noted the last substring when done bijectively does not need the EOS token written.


    now from above if the substring had two character the first chacter would by 1/4 for 2 bits
    the second character postions would have to reserve 1/4 or 2 bits for the seen character and the possible EOS this leaves not much for the possible new character its 1/2 * 1/3 meaning you have to EOS for the 2 bit per string requirement but the EOS character for string of two new character has to be less than 2 bits long it has to have a higher probabilty then 1/4 to slove for it
    1/4 first caharact times 3/16 scond charact time X the EOS caharacter for two charact strings has to be 6 bits or 1/64 for a total of 6 bits in two character case where each of two characters different meaning EOS here would be of probability (1/64) * (4) * (16/3) = 1/3 or 1.58496250072 bits
    note 3/16 + 1/3 equals 4 bits

    going to string of of three the third character of interest can not be the first two seen character or the reserved EOS value the first two seen means each know have probability of 1/4 and the EOS reservered space is 1/3 leaving 1/12 probability for new character 1/4 + 1/4 + 1/3 + 1/12 + 1/12 = 1 and 1/12 means 3.58496250072 bits for new third character. so string length so far for this case is
    from probailitys 1/4 * 3/16 * 1/12 or 2 + 2.41503749928 +3.58496250072 = 8 bits which is max allowed for run of 3 caharacter strings all different so no space left for EOS so it can not be done.

  19. #18
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    You can only assign bit codes that are no more than 2 bits longer than the original string if you change the constaint that strings must contain no more than 3 of the 4 symbols to strings must contain exactly 3 of the 4 symbols. Of course at that point, you might as well use the string plus the missing letter as the code. I suppose you could start adjusting symbol weights, but if the data is random then adjustments will not be beneficial on average.

  20. #19
    Member
    Join Date
    Sep 2014
    Location
    Poland
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Here's my solution.
    Start with 1 bit telling if the next string is a run of one alph. symbol only or not.
    If not, follow with 1 extra bit, telling if the local alphabet size is 2 or 3. Then write S1.
    Once you have a 2nd symbol (in case of a "singleton run") or a 3rd or 4th different symbol (in the latter case), write this new symbol (i.e. the first symbol of S_2) and just AFTER it again use 1 bit telling if the CURRENT string (=S_2) is a run of one alph. symbol or not. Etc., you know.

    Up to 2 extra bits per each S_i needed.

  21. #20
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by Lucek View Post
    Here's my solution.
    Start with 1 bit telling if the next string is a run of one alph. symbol only or not.
    If not, follow with 1 extra bit, telling if the local alphabet size is 2 or 3. Then write S1.
    Once you have a 2nd symbol (in case of a "singleton run") or a 3rd or 4th different symbol (in the latter case), write this new symbol (i.e. the first symbol of S_2) and just AFTER it again use 1 bit telling if the CURRENT string (=S_2) is a run of one alph. symbol or not. Etc., you know.

    Up to 2 extra bits per each S_i needed.
    What about the case of string S = 23? You use 1 bit to indicate it is not a run of one symbol. You use a second bit to indicate this has a local alphabet size of 2. You use 4 bits to write 2 and 3. That's 6 bits. You are out of bits but have not yet indicated the EOS.

  22. #21
    Member
    Join Date
    Sep 2014
    Location
    Poland
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Kennon Conrad View Post
    What about the case of string S = 23? You use 1 bit to indicate it is not a run of one symbol. You use a second bit to indicate this has a local alphabet size of 2. You use 4 bits to write 2 and 3. That's 6 bits. You are out of bits but have not yet indicated the EOS.
    End-of-String symbol necessary? In the example from the first post (https://stackoverflow.com/questions/...trieve-strings) there was nothing on it.

  23. #22
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    since it fails for 2 exta bits per string and its trival for 4 just add the unsed charter front and back now 4 character each. The question may become can it be doen in 3 using simple bijective arithemtic compressor

    for single string let 2 bits be the one of four in pool of 0 1 2 3 4 alowing 1/8 for any
    EOS in postion where only one type of charact makes such a string have an extra 3 bit and
    the total string length for a repeated string would be 2 * N + 3 just what we want

    know for 2 charter string if same charter different of 1/4 and 1/8 reserved for first character and possible EOS if needed 5/8 * 1/3 for the each of the new possible characters.

    know for character string made up two character types you need 5/24 * EOS probabilty to equal 5 bits
    or 1/32 so Next possible EOS if exactly seen only two character type 1/32 * 24/5 or 3/20 which is 2.73696559417 bits for so every string of possible 2 character type now has an extra 3 bits attached


    Now look at strings where a third type of character appears
    1/4 for first know and 1/4 for second known and 3/20 for the EOF to end a 2 character type if needed
    this adds to 13/20 of space used giving 7/40 of space for the new symbol

    so string of 3 types would be 2 bits for first symbol -log((5/(1/3))/log(2) = 2.26303440583 for second and 2.51457317283 this for a unique 3 character string is lest than 7 bits. Now the new EOS could be two bits when ever 3 types known excess would be less then 2 + .264 + .515 = 2.779 the excess only occurs when the first time a new symbol is seen.


    so for works for 3 bits
    Last edited by biject.bwts; 8th September 2014 at 21:11. Reason: need nother beer my clean up tomorrow

  24. #23
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,497
    Thanks
    26
    Thanked 132 Times in 102 Posts
    Quote Originally Posted by Lucek View Post
    End-of-String symbol necessary? In the example from the first post (https://stackoverflow.com/questions/...trieve-strings) there was nothing on it.
    You have to be able to retrieve the substrings from concatenated string, as the question on SO says. So there need to be some sort of indicator (explicit, implicit, whatever) to tell where substrings start and end.

    Without that requirement the solution would be trivial (well, I'm insulting the word 'trivial' here) - just concatenate the strings without doing anything else. By doing that you have no expansion whatsoever (in every possible case), so it doesn't even need the extra 2 bits allowed by questioner.

    If there was no restriction on number of different symbols in a substring (ie we were allowing 4 different symbols to occur in single substring) and every decomposition was equally probable, then we would need exactly 1 extra bit per each symbol to tell whether that's ending symbol for a substring. And that would be disallowing empty substrings.

    Since there's restriction on used portion of alphabet there is opportunity to compress strings longer than 3 symbols. But as my brute-forcer shows - codespace is exhausted before compression ratio reaches high enough level.

  25. #24
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Look trying to make sense of what guy first said. I think trying to limit the thing to two bits per symbol is a silly thing to do.

    I have thought alot about this problem. I think trying to make the composite string short my limiting it to two bits per substring is foooish. What you really want to do is compress the whole package as short as posible. It is actually an easy problem. But if you have many substrings it really the best you can do. You state that if two symbol or three used they are all equally likely in that subsring. I also am assuming that there is no correlation between the sub srings and each is indepentent.

    1) First for every sting start as 2 bits for the one of four possible first characters go to 2

    2) Next part use till substring is done or a second character appears. Realize that one character known. first cell counter for EOS1 this cell is never cleared it shows count for all substring at point where ended after one character know.

    2a if sting ends EOS1 ture go back to 1 for next string.

    2b RC1K this is running cell for 1 know character it is never cleared between substrings.
    if this symbol occurs then go back to 2.

    2c You know now its 1 of 3 character write bits in arithmetic and go to 3

    3 EOS2 cell counts ends that occurred when 2 character know in string never cleared
    if string ends go to 1
    3a RC2k cell for if 1 of two know characters appers if it appear wite bits in arithmetic using 1/2 each since equally likely according to your write up then go step 3

    3b know you write 1/2 probability or a single bit to tell which symbol used got to step 4

    4 EOS3 cell for when all 3 characters know if EOS occurs go to setp 1
    if not write 1/3 prob ability which symbol occured and go to setp 4

    Notice if most string have many characters and if most strings have the three out of symbols being used the EOS3 filure apporachs one in probability and each of the 3 chars in use would approach 1/3 probability and the compbined string would be just a few bits great than 1.58 bits per input sumbol of the substring. It would never reach that value but should easily be less then 2 bits per symbol meaning compression. look i could if enough people want write code to do this.


    taking his example though a poor example S = {S1=0, S2=12, S3=023}

    you write 2 bits for the first token then you write 1 bit for EOS1 update it 2 ,1 (using Lp for up date why not) go to 1 to start next string

    you write 2 bits for fist token then log(3)/log(2) for not ending EOS1 cell now 2,2 next write 1 bit for RCIK it fails and know is 1,2 now you write log(3)/log(2) bits for which the new symbol is go to step 3 write 1 bit for EOS2 leaving count at 2,1 go to 1 for start of next cell.

    you write 2 bits for first character and 1 bit for EOS1 and update it to 3,2 next write log(3/2)/log(2) bits for RC1K updating it to 1,3 now you write log(3)/log(2) bits for which the new symbol is go to step 3 write log(3)/log(2) bits then 1 bit for EOS2 updating count to 2,2 checking RC2K write 1 bit update now end now write 1 bit for wich 1 of 2 the next character is STOP last string no need for EOS3 at this point

    s1 2 + 1 S2 2 + log(3)/log(2) +1 + log(3)/Log(2) +1 S3 2 + 1 + log(3/2)/log(2) + log(3)/log(2) + log(3)/log(2) + 1 + 1 + 1

    that should be about 20 bits for the compression of 2 strings totaling 12 bits. So worse than the 18 bits guy was shooting on

  26. #25
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    I changed your method so that at least every composite string all all the substrings can have a unique break to the sub strings of you king. I prefer the bijective airthmetic coder but this will shorten your composte strings by a minum of 2 bits and often much more it is bijective

    you can still do a simple combining that is bijective by using unique sentinals. Its best to describe the reverse then the compression should be easy to see

    first pretend there is one invisable charter are end of file its is the last sentinal used no need to write it.

    I would like to break apart any file of the symobls 0123 into subsets of either 1 2 or 3 characers from the set of 4 no problem.

    if composite string to decompress has one token and ends either by another sentinal in which case another substring follows one uses the following character available as looped string.

    0 is only one substring of 1
    1 imples 2
    2 imples 3
    3 imples 0

    note using both sentinals on last in combined series you drop last sentinal

    raw sting followed by string in combined file
    0 333
    1 000
    2 111
    3 222

    2 111
    22 121
    222 1121
    2222 1221
    22222 11221 You see the patern

    for most string of two character you either see an alternate char or alternate with chosen char for
    sentinal as above
    23 implies 131
    223 imples 1231 note all 3 here could be 0 instead
    2223 imples 11231
    22223 imples 12231
    222223 imples 11221

    the next case the hardest its 2 character but the sentinal is not used since the main sentinal for start sential charact is data example data twos and threes but 3 first character the sential can't be a 2 since used. It can't be a 1 since that implies first charter a 2 that only leaves 0.
    in this case the string between starting an ending sentinal must contain both 2's and 3's

    323 implies 0230 note if started with 2 would have used sentinal 1
    3323 implies 03230
    33323 implies 00230
    333323 imples 033230
    3333323 imples 003230
    33333323 impies 0333230
    333333323 impli3w 0033230 and so on

    the last case if four symbols used the sentinal is the one not used and this will be the
    if 1 2 3 appear then you know sentinal is 0 suppose 1 not first charact in string
    321123 implies 03211230 four bits of expansion

    suppose 1 is first character then it could be repeated.

    123 imples 01230 note have to see the 3 character in sub string
    1123 imples 001230
    11123 imples 011230
    111123 imples 001230 and so on

    example
    11122231 imples 011222310
    1112223 imples 01122230
    111222 imples 0112220
    11122 imples 011220
    1112 imples 01120
    111 imples 0010
    11 imples 010
    1 imples 000

    your example 0 12 023 would uniquely becone

    333 020 10231 droping last sentinal 3330201023

    ignoring last sentinal

    your x0xx12xx023x is not unique bit decomp is
    000 from 1
    111 from 2
    222 from 3
    333 from 0

    0120 from 112
    3123 from 212

    10231 from 023

  27. #26
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Some one actually asked for a better explaination here goes may have to edit it

    Well I got email from guy starting thread. So here is an easy way to turn this problem into a bijectve huffman bit stream and from there was to go using a bijective transform to either byte size of 2 or 8.

    Here is a quick summary of bijective huffman to pure strings.

    There are 3 types of symbols the Z zero symbol and the O one symbol and all the R rest of symbols.
    The Zero symbol is only all ZEROS
    the One symbol starts with ONE and if longer than one bit zero filled

    As you go through the table you write out in bits the human string from table of interest
    you start with Flag clear.
    if you encounter a Z you set Flag
    if you encounter an R symbol you clear Flag

    Now to make It bijective to string of ones and zeros
    1 if Flag set after you write last symbol you write a O symbol in the stream
    2 if stream if output stream only ones stop youe done! next step if not all ones
    3 drop trailing zeros from string
    4 the last bit is a one or you did it wrong drop unless its only a 1 it your done

    example
    a 1 the O symbol
    b 01 an R symbol could be many in big table
    c 00 The Z symbol

    a 1 from 2 1
    b 01 from 4 0
    c 00 flag set from 1 001 from 4 00
    aa 11 from from 2 11
    ab 101 from 4 10
    ba 011 form 4 01


    example 2
    a 0 the Z symbol 0
    b 11 an R symbol 1
    c 10 the O symbol

    a 0 flag set from 1 010 from 3 01 from 4 0
    b 11 from 2 11
    c 10 from 3 1 from 4 still a 1
    ab 011 form 4 01
    ac 010 flag set from 1 01010 form 3 0101 from 4 010
    aa 00 flag set from 1 0010 from 2 001 from 4 00
    cc 1010 from 2 101 from 4 10

    I hope not to many types if there are I will fix them if someone finds the errors

    Not that simple huffman bijective out of way here is a simple huffman bijection for your problem

    replace you strings with 5 symbols
    take your example of the subsets in a big set
    R = 1 0 1 3 12 3 1 023 1
    first of drop the ending and starting END OF SUBSET SYMBOLS an replace all double such symobls with E
    your example becomes
    0 E 12 E 023
    know this a unique description of your problem along with fact only 2 type of symbols from set of 4 in any of the subsets

    now write first 1 of 4 symbols as 2 bits only done on first of no more symbols in bigstring you done

    next creat an symbol table for 8 values e0 e1 e2 e3 E0 E1 E2 E3 each following
    character has either the 0 to 3 buy itself using e0 to e3 or combined with following marker E0 to E3 use this until on
    a substring two character known say 0 1 for example then use no weights for symbols. This table used only until two symbols
    known and at start of every substing suspose 0 known from either first 2 bits in file or from exit of previous subsring 0E


    look you change order so short codes for known symbol used 4 such table
    e0 10 all know symblos have 2 bits per symbol only 1 known here e0 eo could be 0 1 2 which ever is known
    e1 01
    e2 110 in this case new symbols on substring cost 2 to 3 bits
    e3 001
    E0 1110
    E1 0001 any internal end of substing cost 4 bits 2 for marker and 2 for only giving first known of next subtring
    E2 1111
    E3 0000


    look you change order for which 2 symbols known 6 such tables table is simalar to above but rows such to known character first
    e0 10
    e1 01 2 known symbols have same weight so nice still 2 bits per known symbol. Note e0 as above e1 is second symbol seen
    e2 110
    e3 001 new symbol found costs 3 bits so 1 over 2
    E0 1110
    E1 0011 as 4 bits for 2 for end and 2 for start of next string as abve
    E2 1111
    E3 0000

    once three symbols known say 0 1 2 for example you can drop the 3 symbols you know know its not there
    there are 4 tables
    e0 10
    e1 01 3 symbols known using same weight so still 2 bits per known symbol and no more to find
    e2 11
    E0 0010 here when string ends and new starts it uses either 3 or 4 instead od solids 3 from above
    E1 0001
    E2 0011
    E3 0000

    once a symbol with E used you know one symbol of next subsymbol as soon as next type know you go to table for two knowm



    0 E 12 E 023

    0 is 00 2 bits now go to table of 1 known
    E1 0001 4 bits note 4 bits needed to exit this table and get start value of next
    e2 xxx 3 bits or 2 for getting new symbol when in table of one known go to table of 2 known
    E0 xxxx 4 bits needed to exit this table and get start of next
    e2 xxx 3 bits or 2 needed for getting new symbol when in table when one known go to table of 2 known
    e3 xxx 3 bits needed for third new symbol in this table


    thats is 19 bits max for your example possible one less when you consider bijection and fact counted uncalculated string with the longer possible cases

    For this part A number of substring that have only 1 character type B number of substrigs have two character types
    C number of strings with 3 charact types N number of strings T total characters of substrings in composite counting E characters

    the max length using huffman is B + 2*C + 2*T for example above A = 1 B = 1 C =1 T= 8 note A not used 1 + 2 + 2*8 = 19 bits
    for single character types internal string have a max lenght of 2*X 2*N strings have only one type of character
    when you increare it to strings of 2 character types a max length of 3*X + 2*N strings 2 types of characters
    when you increase to to srtings of 3 character types Max length of 4*X + 2*N N number of strings 3 types of characters

    Suppose you have 10 string of 1 type of characater 100 of 2 and 1000 of 3 then the max length would be 2(10

    if course for this is just a quick and complete bijection. But if goal to to compress many such substings you need arithmetic bijective
    to compress and if many long strings you coulc compress better the the sum of bits in you subsrting,


    below for more extreme compression givine your string are long
    for example table where 1 it is known if the symbol has appreared many times give it a wieght of 1

    e0 10 1 you could use this table if so far a run of same charter appears 3 or more time
    e1 01 010
    e2 110 0010
    e3 001 0110
    E0 1110 0001
    E1 0001 0011
    E2 1111 0111
    E3 0000 0000

    the tables quickly get more complicated if you want to save alot of space of 2 characters known you need composite characters
    you quickly see you need many more symbols and that is would be come very messy quick for little sayings
    the easiest way to do this bijective arithmetic compression every time you start a new sub stream you clear the counts for 0 1 2 3 except
    for the symbol that got you there then you update the trees for the symbols in use. You also update the E noE e symbol from
    a single cell then all long sub strings would compress.
    Last edited by biject.bwts; 12th September 2014 at 21:39. Reason: table error always 4E codes of 4 bits

  28. #27
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    if you want to roll your one bijective arithmetic compress its easy form the string 012 E 231 E 24 ... etc
    after passing first 2 bits you only have tables of 8 symbols each let me describe the first one notice you need e0 e1 e2 e3 E0 E1 E2 E3 eights symbols

    the top entry is control symbol of leaving this tree and starting new substring. 3 to 1 split between leaving this tree to start new substring or going on
    either way the top bit is updated. when you start the tree you have two sub trees one below the path out E and one below the e the path below E never changes
    there are 4 symbols and give a weight of 1 to each node that way they are always 2 bits in length. remember that each string is only of 1 to 3 of 4 symbols and each
    consecutive string independent of previous. Top node is the only one where the counts change and are never reset the others are reset at first entry to this tree symbols in it each weigh 2 for the not used symbols and for the known it weight 3 this give on first pass 2 bits for known symbol but has weight of Ee changes in many runs if could be more or less than 2 bits. In case if no new symbol and end have not occurred the weight of know symbol increases. I would start with 1 and if you have several practice files you could test other amounts. Just remember this subtree resets to the 2 2 2 3 weights at start of every new run when one symbol known.

    the other 2 tables are similar if you understand what I do for first table Each has its top Ee and start with 3 to 1 to give first time use of 4 bits include which new symbol at start of subset and the counters continue to all processing done. As above the table for 2 known symbol each on first pass have to be such that 1/6 1/6 1/3 1/3 wieghts of 1 1 2 2 set at start this gives on first entry length of 2 bits for known symbol and you increase weight by one until new character or exit. One more point if you know 2 he claims that they are equally likely in which case every time through the loop update the 2 known symbols equally. Note when I say 1 1 2 2 I mean 1*n 1*n 2*n 2*n are equally likely weights and when I say oncrement by 1 I mean by n where n and m positive numbers. Note I would only play with these constants if trained by a large number of big composite sets. The fact is it unlikely that small change in what I already have would make much difference.

    the last table for 3 known symbols like above a top tell Ee preloaded once and never reset so 2 bits of exit at start if string ends and 2 bits for with of 4 symbols is next starting symbol of string. This time however only 3 symbols below e part and each could have a starting weight of 1 1 1 the way it would guarantee 2 bits first time you never update this table since the auther claimed that if 3 symbols present then each has exactly the same likely hood.
    I am not sure I believe he wants this since if he does a sub string 012222222222222222 would compress to the same size as 012021102120210201 if you want the
    first to compress more than second then you need to update the last sub table mentioned.

    The point is you have three tops nodes the new reset and are updated till whole composite set done
    each time you jump to a table you reset all the values except the top nodes.

  29. #28
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    given the rules for a composite set of files from subsets its easy to construct a method that always gives 3 extra bits for every one character type string in large composite file. And 3 extra bits for every 2 character type of string in a large composite file. And for 3 character types you would be left with always evcess of about 2.77 bits /per subset of overhead some time 2 sometimes 3. This means it is impossible to do the 2 bits extra for every possible composite file looking 2*n where n number of subsets for every composite file. Looking at it also means its easy to limit it to exta 3 bits / subset. For every possible composite file.

    I can show the proof if people want.

    But if all types of file over a certain length easy to actually compress to smaller than character in the subset. counting character as 2 bits and nothing for EOSits

    The thing about getting the the 2 bit limit / subset if the average character length of subset is such and such is non
    sense. Its a whack a mole trade off. Using the average length. However again if you start the min size for all file type then even straight is possible.

  30. #29
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Talking

    HERE is simple constructions that is bijective

    all type 1 character are 2*N + 3 bits
    all type 2 character are 2*N + ( 2 or 3 ) bits
    all type 3 character are 2"N + 3 bits

    and yes if most string long this is not the way to do it since if most of each type long you can get lots of compression.

    here is a straight simply way that shows how to code it. A way that does not use the full sapce but is easy to do
    write 2 bits or what ever the first charter sen in the composite file
    next if the charter repeat use 2 bits 1/4 of the space say 01 to stand for contunuation,
    do that tell is ends. Write a 000 or 1/8 for spave to terminate and start over

    you have 5/8 of spave left to define when a second charact appear to signal more than 2 character

    you 3 use values of 1/16 to signal the next charact and fact its a 2 charact stream and will end
    when you know its going to end you use the next bit as 0 to mean it stopped 0 thats 3 bits over head. 10 means first symbol again 11 means second seen symbol again and containuals 2 bits per character till end using 1/16 gives 3 extra bits for every 2 only subset you actual can use
    one of the symbols chose and random to use 1/8 for full code sapce in which case 1 out 3 time you would have a 2 character type with 2 extra bits instead of 3.

    them you use the values of 1/8 to indicate the 3 character and fact you will eventuall have a third symbol

    so you used at this point of transiton 2/16 to terminate a type 1 charcter sting a 4/16 to say the type one went on and 1/16 3 for new charact and mark it 2 charact type and 3 2/16 to mark new character and fact 3 will occur. this 2 + 4 + 3 + 6 or 15

    know you continue this 3 in character case not have seen the third character yet but dividing the spave up like this use 11 for first charcter seen and 10 for second then 00 if new character first of 2 possible or 01 if second. Know you have the first 3 character with only one extra bit. Then you use 01 for first known 10 for second known 11 for third and 00 for control end for a total of 3 3bits of over head

    so all files of type 1 have 3 bits of overhead all or type 2 have 2 or 3 bits while those of type 3 always have 3


    starting type woud be 0 , 1, 2, 3 would be in order 00 01 10 11
    examples of type 1 1 11 111 222 333 using 01 for contine
    would be in order (00 000) (01 01 000) (01 01 01 000) (01 01 01 000) (11 01 01 000)

    example of two character type 2 10 11001 133 1212 23
    (01 1000 0) (01 01 1001 11 11 10 0) (10 101 note 1 of 3 will have 3 bits instead of 3 for secong character 11 0) (01 1010 10 11 0) (10 00xx 0) depends on how I set the table up

    example of 3 33311331222333123
    11 01 01 01 "1 of 3/8 values availabe" 110 11 10 10 " 2 choice left 00 fist 01 second" 10 "now 3 characters 10 10 11 11 11 11 01 10 11 00) 3 contorl bits

  31. #30
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Thumbs up

    now that I see its done in 3 here is the easiest way to do it and minimzes the extra bits its sort of like the last post
    hope ita easy to read

    this where composite string in 5 symbol format converting bijectively to bits
    you have the data symnols 0 to 3 and E

    start with this table for the enter for first charcter
    0 10
    1 01
    2 11 first character only
    3 00

    leave the same if 0 1 or 2 occurred 3 occurred assume 3 occured move swap it with 0
    3 10
    1 01
    2 11
    0 001
    E 000 E in this case if needed is end of string


    if not a 3 next leave table the same continue to new symbol or write E 000 3 control bits for substring that had only one character type when built

    either you are now ended or a new character has appeared if new characte a 2 bit symbol say 1 leave the table alone
    3 10
    1 01 table A
    2 11
    0 001
    E 000 now you conintue to you end writing 3 bits 000 or a new character which could be done with 2 of 3 bits

    know the second charcter could have been say 0 instead of the 3 left then the table of 2 character know
    where send create by 3 in this case becomes

    E 00
    0 10
    3 01 now E to zeroes
    1 111
    2 110 so again we that all 2 charter sub string have three bits extra since know use 2 bits when second character cost ant extra bit.

    this leave us when the third character appears with 0 or 1 control character used
    and the last table

    first know character 10
    second know character 01
    thrid know character 11
    the E character is 00

    so the substring of type 1 always have 3 bits
    the type 2 always have 3 bits
    and the type 3 can be have 2 or 3 or 4 bits

    if everything random for the type 3 substring

    to get the 2 bits you have 2/3 when going to second 1/2 or 1/3 chance of 2
    to get the 3 bits you have 2/3 * 1/2 = 1/3 chance of 3
    to get the 4 bits 1/3 * 1 or a 1/3 chance of 4 bits

    the 4 bits is a payne in ass if you always want 3 or less

    so going back to table A

    3 10
    1A 01 table x
    2B 11
    0C 001
    E 000

    there is 5/8 that needs to be redefined to get out of this 2 to 4 bits at end

    C0 string will end and not got to 3 characater types 10 means 3 11 means 0 stop is 0 so 3 bits
    C1 string will end and not got to 3 characater types 10 means 3 11 means 2 stop is 0 so 3 bits
    A0 string will end and not got to 3 characater types 10 means 3 11 means 1 stop is 0 so 2 bits
    B0 string will got to 3 charcter 10 = 3, 11 = 0 continue to 01 = 0 00,01 = 2,1 then go to next table
    B1 string will got to 3 charcter 10 = 3, 11 = 2 continue to 01 = 0 00,01 = 0,1 then go to next table
    B1 string will got to 3 charcter 10 = 3, 11 = 1 continue to 01 = 0 00,01 = 0,2 then go to next table

    now that third character found at cost of 1 bit use 10 01 11 00 for the 3 known characters 00 is end of string so cost 3 charactes.

    summary all string of 1 type of character 3 extra bits 33 would be 7 bits long
    summary all string of 2 types of character 2 or 3 bits extra 32 would be 7 bits while 31 6 bis
    summary all string of 3 types of character 3 bits extra

    012222222 would be 21 bits
    012102120 would also be 21 bits

Similar Threads

  1. Most efficient/practical compression method for short strings?
    By never frog in forum Data Compression
    Replies: 6
    Last Post: 1st September 2009, 04:05

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •