1. ## Concatenating strings

Hello everyone!

I'd like to share a post that I've recently submitted to SO. I would really appreciate any idea or hint to solve this problem.

https://stackoverflow.com/questions/...trieve-strings

Another SO user, Oli, has posted a very useful comment that however seems to work only for longer strings (10s of symbols or more). I was wondering whether there is a solution for shorter strings, for example if the average length is 5 symbols.

Thank you in advance for any comment.  Reply With Quote

2. I don't know if your problem is solvable but let's try.

First problem: are zero lengths strings allowed?

Regardless, you have to design a prefix code. You've said your codes can't expand the input by more than two bits. So each string of length N symbols can be encoded to output of maximum 2 * (N + 1) bits. Since there's no gain from assigning codes shorter than maximum allowed in regard to correctness of solution, we should just assign maximum code length to every input string and then look if we can build prefix code.

So let's assign codes, assuming zero-length strings are possible:
Code:
```- -> 00 (zero length input string)
0 -> 0100
... (there are 4 1-letter strings: from 0 to 3)
3 -> 0111
00 -> 100000
... (there are 16 2-letter strings: from 00 to 33)
33 -> 101111
000 -> 11000000
... (there are 64 3-letter strings: from 000 to 333)
333 -> 11111111```
We're out of codespace now. We can't encode strings longer than 3 symbols if we allow zero-length strings and fill codespace sequentially.

If we disallow zero-length string then situation will be different. A program would be useful there, but I'm not so inclined :]  Reply With Quote

3. Originally Posted by Piotr Tarsa First problem: are zero lengths strings allowed?
Good point. No, in my particular case I don't consider null strings. Only strings of length 1 or more are relevant.  Reply With Quote

4. Either way, because 1 to 3 symbol length input strings can't be compressed you need to use those 2 extra bits to tell the length of input string.
00 - for 1 symbol long input strings,
01 - for 2 ..
10 - for 3 ..
11 - for 4 symbol long or longer input strings.

So the question can now be rephrased (remember: assuming empty input strings are disallowed): can we build prefix code that don't expand data (as we already used those 2 extra bits) and unambiguously encodes every 4 symbol long or longer input string and how?

My guess is: we can't.  Reply With Quote

5. you basically want to encode 5 letter alphabet message using at most 2 bits per symbol. even with additional restrictions that you have provided, i think it is impossible (i can't prove it though, just my intuition).

however, if you impose ordering on your strings, you can get very close to 2 bits per symbol.

1) encode all strings containing only up to 2 different symbols, in lexicographic order. use 1st unused symbol as end of string marker, 2nd unused symbol as end of stream of "strings that can be delimited by 1st unused symbol". also, there's no need to output the last "end of string" before "end of stream".

2) encode the rest of strings, again, in lexicographic order, using the unused 4th symbol as end of string marker. however, unlike the first part, you can't use another symbol as end of stream of "strings delimited by 4th symbol". an obvious (though maybe not the best) solution is to use 4th symbol two times in a row as end of stream marker (essentially encoding empty string to mark end of stream). sadly, each such end of stream marker is 2 bits over your limit. but since there's maximum of 4 of such streams, that's only 8 extra bits regardless of number of strings.  Reply With Quote

6. Any string S of random values (0, 1, 2, 3) can itself be broken down into sub-strings containing no more than 3 out of the 4 symbols in it, ie S = (S1, S2, S3, .. Sn). Yet you are asking for "at most" 2 bits per character in the sub-string and hence by definition at most 2 bits per character in S, which I'll remind you is totally random from 4 symbols.

Basically it's impossible unless the "at most" is "equal to" 2 bits. Then it's trivial unless we want the string length to be an encoded value in there somewhere too (in which case it's impossible too).  Reply With Quote

7. OK. I've actually written the code that proves it's impossible to construct prefix code satisfying your restrictions.

Here it is: http://ideone.com/2sfDag
It writes "error" at the moment of codespace exhaustion.  Reply With Quote

8. One thing I notice is that this could be expressed differently, and a better and more succinct description might promote clarity of thought and also make the constraints more obvious. As I understand, you don't so much have a set as a sequence, which means elements may repeat and order is significant. You didn't say retrieval must be fast, so if I understand you, you could imagine the strings already concatenated and separated with end symbols to make a single large string. The problem is then simply to compress this string. Is that accurate?

Piotr's point about prefix codes seems on the right track. But there are infinitely many strings, so a more pertinent comparison might be to universal codes like Elias delta or Fibonacci. My gut feeling is that the constraints are not enough to guarantee the number of bits you are looking for.  Reply With Quote

9. My quick experiments suggest that with extra 3 bits (instead of extra 2 bits) encoding is possible.

(In my code) simply increase starting value of codeLen from 4 to 5, so that the difference between inputLen and codeLen is equal to 3 (instead of 2) and the main loop magically runs for lots of iterations.

Further experiments show that (alas) if you allow empty input strings then even 3 extra bits aren't enough. And if we go to 4 bits then we can use the simple solution you've described on SO (ie for every substring, prepend and append a symbol not appearing in the substring to that substring - that is trivially correct).  Reply With Quote

10. if it can be done then it can be done with a special case of bijective arithmetic compressor/
looking at worse cases only for a single string
-log(1/4) 2 bits one token no eos token since only if follow up string is eos needed

2 tokens
-log(1/4 * (1/2)(1/3) )/log(2) = 4.5849625... since allowing already seen to be 2 bits same for EOS if followup string needed
out of 6 bits allowed

3 tokens
-log(1/4* (1/2)*(1/3)* (1/4)*(1/2))/log(2) = 7.5849625... in this case 2 symbols known and EOS
out of 8 bits allowed no each token add 2 bits and excess stays the same.

after this worst case for a single string have same excesses since each symol used and eos have weight of two

worse case only for two strings look at 3 tokens for each sting each sting of 3

the fist string compress to 7.5849625... eos to 2 bits the secong string to 7.5849625...
for a length of 17.169925 where 16 allowed

You can not compress so that the max bits gained is no more than two per sting. If each subset made of there symbols not known in advance
but given stings of 1 to 3 symbols only out of a set of 4 you can write a program to solve you problem.

However for long strings and a different bijective compressor you can make the stings much smaller and then if each string of a certain length you could easily solve the problem.

ERROR IN MY POST I feel bad and good at the same time. I went to sleep happy I did this. Then the nightmare in sleep realized that I made a mistake for worst case for several strings of 2 token but together. It was assuming that the EOS character when written had to be 2 bits long for string longer than one token wrong. I can write string worst case substrings of 2 tokens together withe EOS log 3 / log 2 bits = 1.58496250072 which means all substrings made of one or 2 tokens will expand the 2 bits max so I will rewrite this give me a day I woke to early to finish the 3 token strings. So unless you beat me to it. So it might be possible to do it bijectively and the 2 bit expansion limit might work but not sure.  Reply With Quote

11. Originally Posted by Piotr Tarsa OK. I've actually written the code that proves it's impossible to construct prefix code satisfying your restrictions.

Here it is: http://ideone.com/2sfDag
It writes "error" at the moment of codespace exhaustion.
I didn't notice that you were claiming a proof. A proof ends the discussion, naturally. You might take a moment to justify your method, though, for the benefit of the silent readership as well as future generations. Code doesn't really speak for itself.  Reply With Quote

12. ## Thanks:

biject.bwts (8th September 2014)

13. The code just automates what I was doing manually in the first post in this thread and accepts different starting conditions. I didn't do any serious math.  Reply With Quote

14. While not a mathematical proof, it is obviously impossible.

Let S be a string made up of random symbols from an alphabet of size 4. This clearly takes log2(4) bits (==2) to encode per symbol, plus a string length. (Or if we're using end-of-string as an alphabet character then it takes >2 bits per symbol).

Set S' being the set of sub-strings of S such that within S' no more than 3 out of 4 alphabet symbols are present. The S' is ordered such that concatenatation of all elements of S' form the original S again.

If it was possible to compress S' in anything smaller than S then we'd have invented recursive compression (we'd take the compression of S', convert it into base 2 and regenerate a new string S and repeat). We know this to be a lie, so the whole premise of S' being < 2 bits is also in error.  Reply With Quote

15. JamesB:
If I understand correctly, the problem isn't about compressing substrings per se, but making a prefix code for the input strings (ie those that can't have all 4 different symbols at the same time) that doesn't expand any string by more than 2 bits.

What's trivial to prove is that in general, the decomposition of S (where S can have all 4 different symbols present) can't be represented without expansion, unless we assume there's only one possible expansion for each S (in such case, we don't need to store the description of decomposition as it can be deduced from S). The reason for that is that decomposition can be done in many ways and describing it takes space in addition to the symbols themselves. That's very similar to the case with LZ codecs.  Reply With Quote

16. Originally Posted by Piotr Tarsa The code just automates what I was doing manually in the first post in this thread and accepts different starting conditions. I didn't do any serious math.
I was also leaning toward a combinatorial/counting approach. The code has the most trouble at the beginning (as they always do), which makes exhaustive search possible. I tried thinking of a simple argument, but the constraints don't seem to yield one. Exhaustive search is rather hard to poke holes in, so that ends it.  Reply With Quote

17. Originally Posted by Piotr Tarsa JamesB:
If I understand correctly, the problem isn't about compressing substrings per se, but making a prefix code for the input strings (ie those that can't have all 4 different symbols at the same time) that doesn't expand any string by more than 2 bits.
Oops. Yes I misread that bit.  Reply With Quote

18. since you can have any number of one character string togther you need 2 bits to pick the one of 4 posible characters in each when you go to the next charact it can be 1/5 choices the EOS or still any of the 4 characters in the set put eos has to be 2 bits so in the limit you get 2 extra bits per string. Noe suppose you want to use 2.001 bits for the eof. That would be fine if you only had a small number of strings but finailly .001 * N nubmer of string will be greater thean say 10 then the excess viloates the require of 2 bits per string even if the last string had no bits. There for in the second postion you have to have 2 bits or 1/4 probility for the EOS reserved.

This also means one would code any substring of one charact type as 2 bits for the character and 2 bits for the EOS except as noted the last substring when done bijectively does not need the EOS token written.

now from above if the substring had two character the first chacter would by 1/4 for 2 bits
the second character postions would have to reserve 1/4 or 2 bits for the seen character and the possible EOS this leaves not much for the possible new character its 1/2 * 1/3 meaning you have to EOS for the 2 bit per string requirement but the EOS character for string of two new character has to be less than 2 bits long it has to have a higher probabilty then 1/4 to slove for it
1/4 first caharact times 3/16 scond charact time X the EOS caharacter for two charact strings has to be 6 bits or 1/64 for a total of 6 bits in two character case where each of two characters different meaning EOS here would be of probability (1/64) * (4) * (16/3) = 1/3 or 1.58496250072 bits
note 3/16 + 1/3 equals 4 bits

going to string of of three the third character of interest can not be the first two seen character or the reserved EOS value the first two seen means each know have probability of 1/4 and the EOS reservered space is 1/3 leaving 1/12 probability for new character 1/4 + 1/4 + 1/3 + 1/12 + 1/12 = 1 and 1/12 means 3.58496250072 bits for new third character. so string length so far for this case is
from probailitys 1/4 * 3/16 * 1/12 or 2 + 2.41503749928 +3.58496250072 = 8 bits which is max allowed for run of 3 caharacter strings all different so no space left for EOS so it can not be done.  Reply With Quote

19. You can only assign bit codes that are no more than 2 bits longer than the original string if you change the constaint that strings must contain no more than 3 of the 4 symbols to strings must contain exactly 3 of the 4 symbols. Of course at that point, you might as well use the string plus the missing letter as the code. I suppose you could start adjusting symbol weights, but if the data is random then adjustments will not be beneficial on average.  Reply With Quote

20. Here's my solution.
Start with 1 bit telling if the next string is a run of one alph. symbol only or not.
If not, follow with 1 extra bit, telling if the local alphabet size is 2 or 3. Then write S1.
Once you have a 2nd symbol (in case of a "singleton run") or a 3rd or 4th different symbol (in the latter case), write this new symbol (i.e. the first symbol of S_2) and just AFTER it again use 1 bit telling if the CURRENT string (=S_2) is a run of one alph. symbol or not. Etc., you know.

Up to 2 extra bits per each S_i needed.  Reply With Quote

21. Originally Posted by Lucek Here's my solution.
Start with 1 bit telling if the next string is a run of one alph. symbol only or not.
If not, follow with 1 extra bit, telling if the local alphabet size is 2 or 3. Then write S1.
Once you have a 2nd symbol (in case of a "singleton run") or a 3rd or 4th different symbol (in the latter case), write this new symbol (i.e. the first symbol of S_2) and just AFTER it again use 1 bit telling if the CURRENT string (=S_2) is a run of one alph. symbol or not. Etc., you know.

Up to 2 extra bits per each S_i needed.
What about the case of string S = 23? You use 1 bit to indicate it is not a run of one symbol. You use a second bit to indicate this has a local alphabet size of 2. You use 4 bits to write 2 and 3. That's 6 bits. You are out of bits but have not yet indicated the EOS.  Reply With Quote

22. Originally Posted by Kennon Conrad What about the case of string S = 23? You use 1 bit to indicate it is not a run of one symbol. You use a second bit to indicate this has a local alphabet size of 2. You use 4 bits to write 2 and 3. That's 6 bits. You are out of bits but have not yet indicated the EOS.
End-of-String symbol necessary? In the example from the first post (https://stackoverflow.com/questions/...trieve-strings) there was nothing on it.  Reply With Quote

23. since it fails for 2 exta bits per string and its trival for 4 just add the unsed charter front and back now 4 character each. The question may become can it be doen in 3 using simple bijective arithemtic compressor

for single string let 2 bits be the one of four in pool of 0 1 2 3 4 alowing 1/8 for any
EOS in postion where only one type of charact makes such a string have an extra 3 bit and
the total string length for a repeated string would be 2 * N + 3 just what we want

know for 2 charter string if same charter different of 1/4 and 1/8 reserved for first character and possible EOS if needed 5/8 * 1/3 for the each of the new possible characters.

know for character string made up two character types you need 5/24 * EOS probabilty to equal 5 bits
or 1/32 so Next possible EOS if exactly seen only two character type 1/32 * 24/5 or 3/20 which is 2.73696559417 bits for so every string of possible 2 character type now has an extra 3 bits attached

Now look at strings where a third type of character appears
1/4 for first know and 1/4 for second known and 3/20 for the EOF to end a 2 character type if needed
this adds to 13/20 of space used giving 7/40 of space for the new symbol

so string of 3 types would be 2 bits for first symbol -log((5/ (1/3))/log(2) = 2.26303440583 for second and 2.51457317283 this for a unique 3 character string is lest than 7 bits. Now the new EOS could be two bits when ever 3 types known excess would be less then 2 + .264 + .515 = 2.779 the excess only occurs when the first time a new symbol is seen.

so for works for 3 bits  Reply With Quote

24. Originally Posted by Lucek End-of-String symbol necessary? In the example from the first post (https://stackoverflow.com/questions/...trieve-strings) there was nothing on it.
You have to be able to retrieve the substrings from concatenated string, as the question on SO says. So there need to be some sort of indicator (explicit, implicit, whatever) to tell where substrings start and end.

Without that requirement the solution would be trivial (well, I'm insulting the word 'trivial' here) - just concatenate the strings without doing anything else. By doing that you have no expansion whatsoever (in every possible case), so it doesn't even need the extra 2 bits allowed by questioner.

If there was no restriction on number of different symbols in a substring (ie we were allowing 4 different symbols to occur in single substring) and every decomposition was equally probable, then we would need exactly 1 extra bit per each symbol to tell whether that's ending symbol for a substring. And that would be disallowing empty substrings.

Since there's restriction on used portion of alphabet there is opportunity to compress strings longer than 3 symbols. But as my brute-forcer shows - codespace is exhausted before compression ratio reaches high enough level.  Reply With Quote

25. Look trying to make sense of what guy first said. I think trying to limit the thing to two bits per symbol is a silly thing to do.

I have thought alot about this problem. I think trying to make the composite string short my limiting it to two bits per substring is foooish. What you really want to do is compress the whole package as short as posible. It is actually an easy problem. But if you have many substrings it really the best you can do. You state that if two symbol or three used they are all equally likely in that subsring. I also am assuming that there is no correlation between the sub srings and each is indepentent.

1) First for every sting start as 2 bits for the one of four possible first characters go to 2

2) Next part use till substring is done or a second character appears. Realize that one character known. first cell counter for EOS1 this cell is never cleared it shows count for all substring at point where ended after one character know.

2a if sting ends EOS1 ture go back to 1 for next string.

2b RC1K this is running cell for 1 know character it is never cleared between substrings.
if this symbol occurs then go back to 2.

2c You know now its 1 of 3 character write bits in arithmetic and go to 3

3 EOS2 cell counts ends that occurred when 2 character know in string never cleared
if string ends go to 1
3a RC2k cell for if 1 of two know characters appers if it appear wite bits in arithmetic using 1/2 each since equally likely according to your write up then go step 3

3b know you write 1/2 probability or a single bit to tell which symbol used got to step 4

4 EOS3 cell for when all 3 characters know if EOS occurs go to setp 1
if not write 1/3 prob ability which symbol occured and go to setp 4

Notice if most string have many characters and if most strings have the three out of symbols being used the EOS3 filure apporachs one in probability and each of the 3 chars in use would approach 1/3 probability and the compbined string would be just a few bits great than 1.58 bits per input sumbol of the substring. It would never reach that value but should easily be less then 2 bits per symbol meaning compression. look i could if enough people want write code to do this.

taking his example though a poor example S = {S1=0, S2=12, S3=023}

you write 2 bits for the first token then you write 1 bit for EOS1 update it 2 ,1 (using Lp for up date why not) go to 1 to start next string

you write 2 bits for fist token then log(3)/log(2) for not ending EOS1 cell now 2,2 next write 1 bit for RCIK it fails and know is 1,2 now you write log(3)/log(2) bits for which the new symbol is go to step 3 write 1 bit for EOS2 leaving count at 2,1 go to 1 for start of next cell.

you write 2 bits for first character and 1 bit for EOS1 and update it to 3,2 next write log(3/2)/log(2) bits for RC1K updating it to 1,3 now you write log(3)/log(2) bits for which the new symbol is go to step 3 write log(3)/log(2) bits then 1 bit for EOS2 updating count to 2,2 checking RC2K write 1 bit update now end now write 1 bit for wich 1 of 2 the next character is STOP last string no need for EOS3 at this point

s1 2 + 1 S2 2 + log(3)/log(2) +1 + log(3)/Log(2) +1 S3 2 + 1 + log(3/2)/log(2) + log(3)/log(2) + log(3)/log(2) + 1 + 1 + 1

that should be about 20 bits for the compression of 2 strings totaling 12 bits. So worse than the 18 bits guy was shooting on  Reply With Quote

26. I changed your method so that at least every composite string all all the substrings can have a unique break to the sub strings of you king. I prefer the bijective airthmetic coder but this will shorten your composte strings by a minum of 2 bits and often much more it is bijective

you can still do a simple combining that is bijective by using unique sentinals. Its best to describe the reverse then the compression should be easy to see

first pretend there is one invisable charter are end of file its is the last sentinal used no need to write it.

I would like to break apart any file of the symobls 0123 into subsets of either 1 2 or 3 characers from the set of 4 no problem.

if composite string to decompress has one token and ends either by another sentinal in which case another substring follows one uses the following character available as looped string.

0 is only one substring of 1
1 imples 2
2 imples 3
3 imples 0

note using both sentinals on last in combined series you drop last sentinal

raw sting followed by string in combined file
0 333
1 000
2 111
3 222

2 111
22 121
222 1121
2222 1221
22222 11221 You see the patern

for most string of two character you either see an alternate char or alternate with chosen char for
sentinal as above
23 implies 131
223 imples 1231 note all 3 here could be 0 instead
2223 imples 11231
22223 imples 12231
222223 imples 11221

the next case the hardest its 2 character but the sentinal is not used since the main sentinal for start sential charact is data example data twos and threes but 3 first character the sential can't be a 2 since used. It can't be a 1 since that implies first charter a 2 that only leaves 0.
in this case the string between starting an ending sentinal must contain both 2's and 3's

323 implies 0230 note if started with 2 would have used sentinal 1
3323 implies 03230
33323 implies 00230
333323 imples 033230
3333323 imples 003230
33333323 impies 0333230
333333323 impli3w 0033230 and so on

the last case if four symbols used the sentinal is the one not used and this will be the
if 1 2 3 appear then you know sentinal is 0 suppose 1 not first charact in string
321123 implies 03211230 four bits of expansion

suppose 1 is first character then it could be repeated.

123 imples 01230 note have to see the 3 character in sub string
1123 imples 001230
11123 imples 011230
111123 imples 001230 and so on

example
11122231 imples 011222310
1112223 imples 01122230
111222 imples 0112220
11122 imples 011220
1112 imples 01120
111 imples 0010
11 imples 010
1 imples 000

your example 0 12 023 would uniquely becone

333 020 10231 droping last sentinal 3330201023

ignoring last sentinal

your x0xx12xx023x is not unique bit decomp is
000 from 1
111 from 2
222 from 3
333 from 0

0120 from 112
3123 from 212

10231 from 023  Reply With Quote

27. Some one actually asked for a better explaination here goes may have to edit it

Well I got email from guy starting thread. So here is an easy way to turn this problem into a bijectve huffman bit stream and from there was to go using a bijective transform to either byte size of 2 or 8.

Here is a quick summary of bijective huffman to pure strings.

There are 3 types of symbols the Z zero symbol and the O one symbol and all the R rest of symbols.
The Zero symbol is only all ZEROS
the One symbol starts with ONE and if longer than one bit zero filled

As you go through the table you write out in bits the human string from table of interest
if you encounter a Z you set Flag
if you encounter an R symbol you clear Flag

Now to make It bijective to string of ones and zeros
1 if Flag set after you write last symbol you write a O symbol in the stream
2 if stream if output stream only ones stop youe done! next step if not all ones
3 drop trailing zeros from string
4 the last bit is a one or you did it wrong drop unless its only a 1 it your done

example
a 1 the O symbol
b 01 an R symbol could be many in big table
c 00 The Z symbol

a 1 from 2 1
b 01 from 4 0
c 00 flag set from 1 001 from 4 00
aa 11 from from 2 11
ab 101 from 4 10
ba 011 form 4 01

example 2
a 0 the Z symbol 0
b 11 an R symbol 1
c 10 the O symbol

a 0 flag set from 1 010 from 3 01 from 4 0
b 11 from 2 11
c 10 from 3 1 from 4 still a 1
ab 011 form 4 01
ac 010 flag set from 1 01010 form 3 0101 from 4 010
aa 00 flag set from 1 0010 from 2 001 from 4 00
cc 1010 from 2 101 from 4 10

I hope not to many types if there are I will fix them if someone finds the errors

Not that simple huffman bijective out of way here is a simple huffman bijection for your problem

replace you strings with 5 symbols
take your example of the subsets in a big set
R = 1 0 1 3 12 3 1 023 1
first of drop the ending and starting END OF SUBSET SYMBOLS an replace all double such symobls with E
0 E 12 E 023
know this a unique description of your problem along with fact only 2 type of symbols from set of 4 in any of the subsets

now write first 1 of 4 symbols as 2 bits only done on first of no more symbols in bigstring you done

next creat an symbol table for 8 values e0 e1 e2 e3 E0 E1 E2 E3 each following
character has either the 0 to 3 buy itself using e0 to e3 or combined with following marker E0 to E3 use this until on
a substring two character known say 0 1 for example then use no weights for symbols. This table used only until two symbols
known and at start of every substing suspose 0 known from either first 2 bits in file or from exit of previous subsring 0E

look you change order so short codes for known symbol used 4 such table
e0 10 all know symblos have 2 bits per symbol only 1 known here e0 eo could be 0 1 2 which ever is known
e1 01
e2 110 in this case new symbols on substring cost 2 to 3 bits
e3 001
E0 1110
E1 0001 any internal end of substing cost 4 bits 2 for marker and 2 for only giving first known of next subtring
E2 1111
E3 0000

look you change order for which 2 symbols known 6 such tables table is simalar to above but rows such to known character first
e0 10
e1 01 2 known symbols have same weight so nice still 2 bits per known symbol. Note e0 as above e1 is second symbol seen
e2 110
e3 001 new symbol found costs 3 bits so 1 over 2
E0 1110
E1 0011 as 4 bits for 2 for end and 2 for start of next string as abve
E2 1111
E3 0000

once three symbols known say 0 1 2 for example you can drop the 3 symbols you know know its not there
there are 4 tables
e0 10
e1 01 3 symbols known using same weight so still 2 bits per known symbol and no more to find
e2 11
E0 0010 here when string ends and new starts it uses either 3 or 4 instead od solids 3 from above
E1 0001
E2 0011
E3 0000

once a symbol with E used you know one symbol of next subsymbol as soon as next type know you go to table for two knowm

0 E 12 E 023

0 is 00 2 bits now go to table of 1 known
E1 0001 4 bits note 4 bits needed to exit this table and get start value of next
e2 xxx 3 bits or 2 for getting new symbol when in table of one known go to table of 2 known
E0 xxxx 4 bits needed to exit this table and get start of next
e2 xxx 3 bits or 2 needed for getting new symbol when in table when one known go to table of 2 known
e3 xxx 3 bits needed for third new symbol in this table

thats is 19 bits max for your example possible one less when you consider bijection and fact counted uncalculated string with the longer possible cases

For this part A number of substring that have only 1 character type B number of substrigs have two character types
C number of strings with 3 charact types N number of strings T total characters of substrings in composite counting E characters

the max length using huffman is B + 2*C + 2*T for example above A = 1 B = 1 C =1 T= 8 note A not used 1 + 2 + 2*8 = 19 bits
for single character types internal string have a max lenght of 2*X 2*N strings have only one type of character
when you increare it to strings of 2 character types a max length of 3*X + 2*N strings 2 types of characters
when you increase to to srtings of 3 character types Max length of 4*X + 2*N N number of strings 3 types of characters

Suppose you have 10 string of 1 type of characater 100 of 2 and 1000 of 3 then the max length would be 2(10

if course for this is just a quick and complete bijection. But if goal to to compress many such substings you need arithmetic bijective
to compress and if many long strings you coulc compress better the the sum of bits in you subsrting,

below for more extreme compression givine your string are long
for example table where 1 it is known if the symbol has appreared many times give it a wieght of 1

e0 10 1 you could use this table if so far a run of same charter appears 3 or more time
e1 01 010
e2 110 0010
e3 001 0110
E0 1110 0001
E1 0001 0011
E2 1111 0111
E3 0000 0000

the tables quickly get more complicated if you want to save alot of space of 2 characters known you need composite characters
you quickly see you need many more symbols and that is would be come very messy quick for little sayings
the easiest way to do this bijective arithmetic compression every time you start a new sub stream you clear the counts for 0 1 2 3 except
for the symbol that got you there then you update the trees for the symbols in use. You also update the E noE e symbol from
a single cell then all long sub strings would compress.  Reply With Quote

28. if you want to roll your one bijective arithmetic compress its easy form the string 012 E 231 E 24 ... etc
after passing first 2 bits you only have tables of 8 symbols each let me describe the first one notice you need e0 e1 e2 e3 E0 E1 E2 E3 eights symbols

the top entry is control symbol of leaving this tree and starting new substring. 3 to 1 split between leaving this tree to start new substring or going on
either way the top bit is updated. when you start the tree you have two sub trees one below the path out E and one below the e the path below E never changes
there are 4 symbols and give a weight of 1 to each node that way they are always 2 bits in length. remember that each string is only of 1 to 3 of 4 symbols and each
consecutive string independent of previous. Top node is the only one where the counts change and are never reset the others are reset at first entry to this tree symbols in it each weigh 2 for the not used symbols and for the known it weight 3 this give on first pass 2 bits for known symbol but has weight of Ee changes in many runs if could be more or less than 2 bits. In case if no new symbol and end have not occurred the weight of know symbol increases. I would start with 1 and if you have several practice files you could test other amounts. Just remember this subtree resets to the 2 2 2 3 weights at start of every new run when one symbol known.

the other 2 tables are similar if you understand what I do for first table Each has its top Ee and start with 3 to 1 to give first time use of 4 bits include which new symbol at start of subset and the counters continue to all processing done. As above the table for 2 known symbol each on first pass have to be such that 1/6 1/6 1/3 1/3 wieghts of 1 1 2 2 set at start this gives on first entry length of 2 bits for known symbol and you increase weight by one until new character or exit. One more point if you know 2 he claims that they are equally likely in which case every time through the loop update the 2 known symbols equally. Note when I say 1 1 2 2 I mean 1*n 1*n 2*n 2*n are equally likely weights and when I say oncrement by 1 I mean by n where n and m positive numbers. Note I would only play with these constants if trained by a large number of big composite sets. The fact is it unlikely that small change in what I already have would make much difference.

the last table for 3 known symbols like above a top tell Ee preloaded once and never reset so 2 bits of exit at start if string ends and 2 bits for with of 4 symbols is next starting symbol of string. This time however only 3 symbols below e part and each could have a starting weight of 1 1 1 the way it would guarantee 2 bits first time you never update this table since the auther claimed that if 3 symbols present then each has exactly the same likely hood.
I am not sure I believe he wants this since if he does a sub string 012222222222222222 would compress to the same size as 012021102120210201 if you want the
first to compress more than second then you need to update the last sub table mentioned.

The point is you have three tops nodes the new reset and are updated till whole composite set done
each time you jump to a table you reset all the values except the top nodes.  Reply With Quote

29. given the rules for a composite set of files from subsets its easy to construct a method that always gives 3 extra bits for every one character type string in large composite file. And 3 extra bits for every 2 character type of string in a large composite file. And for 3 character types you would be left with always evcess of about 2.77 bits /per subset of overhead some time 2 sometimes 3. This means it is impossible to do the 2 bits extra for every possible composite file looking 2*n where n number of subsets for every composite file. Looking at it also means its easy to limit it to exta 3 bits / subset. For every possible composite file.

I can show the proof if people want.

But if all types of file over a certain length easy to actually compress to smaller than character in the subset. counting character as 2 bits and nothing for EOSits

The thing about getting the the 2 bit limit / subset if the average character length of subset is such and such is non
sense. Its a whack a mole trade off. Using the average length. However again if you start the min size for all file type then even straight is possible.  Reply With Quote

30. ## HERE is simple constructions that is bijective

all type 1 character are 2*N + 3 bits
all type 2 character are 2*N + ( 2 or 3 ) bits
all type 3 character are 2"N + 3 bits

and yes if most string long this is not the way to do it since if most of each type long you can get lots of compression.

here is a straight simply way that shows how to code it. A way that does not use the full sapce but is easy to do
write 2 bits or what ever the first charter sen in the composite file
next if the charter repeat use 2 bits 1/4 of the space say 01 to stand for contunuation,
do that tell is ends. Write a 000 or 1/8 for spave to terminate and start over

you have 5/8 of spave left to define when a second charact appear to signal more than 2 character

you 3 use values of 1/16 to signal the next charact and fact its a 2 charact stream and will end
when you know its going to end you use the next bit as 0 to mean it stopped 0 thats 3 bits over head. 10 means first symbol again 11 means second seen symbol again and containuals 2 bits per character till end using 1/16 gives 3 extra bits for every 2 only subset you actual can use
one of the symbols chose and random to use 1/8 for full code sapce in which case 1 out 3 time you would have a 2 character type with 2 extra bits instead of 3.

them you use the values of 1/8 to indicate the 3 character and fact you will eventuall have a third symbol

so you used at this point of transiton 2/16 to terminate a type 1 charcter sting a 4/16 to say the type one went on and 1/16 3 for new charact and mark it 2 charact type and 3 2/16 to mark new character and fact 3 will occur. this 2 + 4 + 3 + 6 or 15

know you continue this 3 in character case not have seen the third character yet but dividing the spave up like this use 11 for first charcter seen and 10 for second then 00 if new character first of 2 possible or 01 if second. Know you have the first 3 character with only one extra bit. Then you use 01 for first known 10 for second known 11 for third and 00 for control end for a total of 3 3bits of over head

so all files of type 1 have 3 bits of overhead all or type 2 have 2 or 3 bits while those of type 3 always have 3

starting type woud be 0 , 1, 2, 3 would be in order 00 01 10 11
examples of type 1 1 11 111 222 333 using 01 for contine
would be in order (00 000) (01 01 000) (01 01 01 000) (01 01 01 000) (11 01 01 000)

example of two character type 2 10 11001 133 1212 23
(01 1000 0) (01 01 1001 11 11 10 0) (10 101 note 1 of 3 will have 3 bits instead of 3 for secong character 11 0) (01 1010 10 11 0) (10 00xx 0) depends on how I set the table up

example of 3 33311331222333123
11 01 01 01 "1 of 3/8 values availabe" 110 11 10 10 " 2 choice left 00 fist 01 second" 10 "now 3 characters 10 10 11 11 11 11 01 10 11 00) 3 contorl bits  Reply With Quote

31. ## now that I see its done in 3 here is the easiest way to do it and minimzes the extra bits its sort of like the last post

this where composite string in 5 symbol format converting bijectively to bits
you have the data symnols 0 to 3 and E

0 10
1 01
2 11 first character only
3 00

leave the same if 0 1 or 2 occurred 3 occurred assume 3 occured move swap it with 0
3 10
1 01
2 11
0 001
E 000 E in this case if needed is end of string

if not a 3 next leave table the same continue to new symbol or write E 000 3 control bits for substring that had only one character type when built

either you are now ended or a new character has appeared if new characte a 2 bit symbol say 1 leave the table alone
3 10
1 01 table A
2 11
0 001
E 000 now you conintue to you end writing 3 bits 000 or a new character which could be done with 2 of 3 bits

know the second charcter could have been say 0 instead of the 3 left then the table of 2 character know
where send create by 3 in this case becomes

E 00
0 10
3 01 now E to zeroes
1 111
2 110 so again we that all 2 charter sub string have three bits extra since know use 2 bits when second character cost ant extra bit.

this leave us when the third character appears with 0 or 1 control character used
and the last table

first know character 10
second know character 01
thrid know character 11
the E character is 00

so the substring of type 1 always have 3 bits
the type 2 always have 3 bits
and the type 3 can be have 2 or 3 or 4 bits

if everything random for the type 3 substring

to get the 2 bits you have 2/3 when going to second 1/2 or 1/3 chance of 2
to get the 3 bits you have 2/3 * 1/2 = 1/3 chance of 3
to get the 4 bits 1/3 * 1 or a 1/3 chance of 4 bits

the 4 bits is a payne in ass if you always want 3 or less

so going back to table A

3 10
1A 01 table x
2B 11
0C 001
E 000

there is 5/8 that needs to be redefined to get out of this 2 to 4 bits at end

C0 string will end and not got to 3 characater types 10 means 3 11 means 0 stop is 0 so 3 bits
C1 string will end and not got to 3 characater types 10 means 3 11 means 2 stop is 0 so 3 bits
A0 string will end and not got to 3 characater types 10 means 3 11 means 1 stop is 0 so 2 bits
B0 string will got to 3 charcter 10 = 3, 11 = 0 continue to 01 = 0 00,01 = 2,1 then go to next table
B1 string will got to 3 charcter 10 = 3, 11 = 2 continue to 01 = 0 00,01 = 0,1 then go to next table
B1 string will got to 3 charcter 10 = 3, 11 = 1 continue to 01 = 0 00,01 = 0,2 then go to next table

now that third character found at cost of 1 bit use 10 01 11 00 for the 3 known characters 00 is end of string so cost 3 charactes.

summary all string of 1 type of character 3 extra bits 33 would be 7 bits long
summary all string of 2 types of character 2 or 3 bits extra 32 would be 7 bits while 31 6 bis
summary all string of 3 types of character 3 bits extra

012222222 would be 21 bits
012102120 would also be 21 bits  Reply With Quote

algorithm, bit-manipulation, string 