# Thread: Is there a solution, change the frequency of the letters within the text in the file

1. ## Is there a solution, change the frequency of the letters within the text in the file

I have a text file that contains letters,, each character has a frequency: Example
Order Symbol Frequency
---------------------------------------------
1 A ,70
2 B ,62
3 E ,35

4 H ,32
5 D, 29
6 G ,28
7 F ,25
8 C ,19

TOTAL=-300 char

How can change the frequency letters?, for example:
Order Symbol Frequency
-------------------------------------
1 A, 85
2 B ,52
3 E ,30

4 H ,32
5 D ,29
6 G ,28
7 F ,25
8 C ,19

TOTAL=-300 char

thanks
Omran Frhat

2. EDIT: I had misunderstood the question, the following examples are wrong.

Here are 3 examples of C function (not tested) to change an upper case letter to another upper case letter.
c0 is the current character, c1 is the previuos character.

Code:
```int change(int c0) {
static int counter = 0;
return (c0 >= 'A' && c0 <= 'Z') ? ((c0 + counter++) % 26 + 'A') : c0;
}

int change(int c0, int c1) {
return (c0 >= 'A' && c0 <= 'Z') ? ((c0 + c1) % 26 + 'A') : c0;
}

// Call change(-1) to initialize the MTF table.
int change(int c0) {
static int mtf[26];
int i, c0b = c0;
if (c0 == -1)
for(int i = 0; i < 26; i++) mtf = 'A' + i;
else if (c0 >= 'A' && c0 <= 'Z') {
// Maybe it's not the best MTF implementation.
for(c0b = 0; c0b < 25 && c0 != mtf[c0b]; c0b++) ;
for(i = c0b; i > 0; i--) mtf = mtf[i - 1];
mtf[0] = c0b += 'A';
}
return c0b;
}```

3. ## Thanks:

omran farhat (29th August 2016)

4. Here's how you do it with a rangecoder - http://ideone.com/HZ9q6R (the method, not an actual implementation)
This method actually allows to produce the specific requested freq distribution, because 2nd distribution is "smaller" than 1st one:

c[n_,k_]:=N[Binomial[n,k]]
// 1st distribution
Log[2.,c[300,70]*c[300-70,62]*c[300-70-62,35]*c[300-70-62-35,32]*c[300-70-62-35-32,29]*c[300-70-62-35-32-29,28]*c[300-70-62-35-32-29-28,25]*c[300-70-62-35-32-29-28-25,19]]
Log[2.,Binomial[19,19]*Binomial[44,25]*Binomial[72,28]*Binomial[101,29]*Binomial[133,32]*Binomial[168,35]*Binomial[230,62]*Binomial[300,70]]
http://www.wolframalpha.com/input/?i...5B300,70%5D%5D
=833.036 bits of data can be encoded via choice of specific permutation with given freqs

// 2nd distribution
Log[2.,c[300,85]*c[300-85,52]*c[300-85-52,30]*c[300-85-52-30,32]*c[300-85-52-30-32,29]*c[300-85-52-30-32-29,28]*c[300-85-52-30-32-29-28,25]*c[300-85-52-30-32-29-28-25,19]]
Log[2.,Binomial[19,19]*Binomial[44,25]*Binomial[72,28]*Binomial[101,29]*Binomial[133,32]*Binomial[163,30]*Binomial[215,52]*Binomial[300,85]]
http://www.wolframalpha.com/input/?i...5B300,85%5D%5D
=822.44 bits

There's less permutations of 2nd distribution, so its possible to transform 1st into 2nd perfectly with a rangecoder.
But you'd lose 10.596 bits of data in that transformation, so a perfect inverse transform would be impossible -
you'd lose (833.036 - 822.44)/Log[2, StringLength["ABEHDGFC"]] = 3.532 ... basically 4 symbols from the first string.

5. ## Thanks:

omran farhat (29th August 2016)

6. To use a simple example, he is asking how to do a transform that would change the distribution to (I guess) to something less uniform to make the file more compressible, like "ABAB" (A=2, B=2) to "AAAB" (A=3, B=1). The answer is you can't. The second string has lower entropy (in an order-0 model), so it would compress smaller, but the transform is lossy. You lose information. There are 6 possible strings with A=2, B=2 but only 4 with A=3, B=1.

7. ## Thanks:

omran farhat (30th August 2016)

8. Thanks Matt
That's what I'm looking for ..
If you have a file with
a = 1000 * (1) = 1000 bits
b = 800 *(2) = 1600 bits
c = 300 * (2) = 600 bits
----------------------------
total =3200 bits = 400 bytes
a = 1300 * (1) = 1300 bits ******** a plus 300 char
b = 600 * (2) = 1200 bits ******* lower 200 char
c = 200 *(2) = 400 bits ****** lower 100 char
------------------------------
total = 2900 bits ~ 362.5 bytes
Is there a compression algorithm works this way?

9. That's not how you count information.
There's a_freq occurrences of "a" in a string of length=total.
So from combinatorics you'd know that there's Binomial[total,a_freq] = total!/a_freq!/(total-a_freq)! unique placements of a_freq x "a" in the string.
1) Log[2.,Binomial[2100,1000]*Binomial[1100,800]] = 3015.40 http://www.wolframalpha.com/input/?i...1100,800%5D%5D
2) Log[2.,Binomial[2100,1300]*Binomial[800,600]] = 2651.57 http://www.wolframalpha.com/input/?i...B800,600%5D%5D
That's how much bits you need to encode each string, when you know the freqs.
Arithmetic coding also makes it possible to reach these numbers almost perfectly.
But there's more combinations in (1), so you can't transform string2 to any string1 - that's exactly why it takes more bits to encode string1.

Although simply using arithmetic coding would be just as good as what you want, though 3016 bits instead of 2900 :)
But remember that freqs actually also have to be encoded :)

10. ## Thanks:

omran farhat (30th August 2016)

11. Thank you mr Shelwien and Dr.Matt Mahoney
Example we have text file abcabcdbeaca ............
step 1: I use Huffman's algorithm with a variable-length code table:
a (500) 00 2-bits
b (400) 010 3-bits
c (300) 0110 4-bits
d (200) 01110 5-bits
e (100) 01111 5-bits
In normal case, the source coding, such as ,,
a b c a b c d b e a c a ...............
00 010 0110 00 010 0110 01110 010 01111 00 0110 00 ............ Here compressed file
==4900 bits - 612.5 bytes
step 2 : But before the completion of the encoding process,, we want to Change the frequency for a source symbol
step 3: Change in frequency
For example, you get the following frequency
a (650) 00
b (300) 010
c (250) 0110
d (200) 01110
e (100) 01111
step 4 now in this step encoding a source symbol and Add in the head characters that have been changed
a b a a b c d a e a a a ........
00 010 00 00 010 0110 01110 00 01111 00 00 00 ........
== 4700 bits - 587.5 bytes
with Huffman's algorithm: we got 4900 bits
a b (c) a b c d (b) e a (c) a .........
a b (a) a b c d (a) e a (a) a .......
00 010 (0110) 00 010 0110 01110 (010) 01111 00 (0110) 00
00 010 (00) 00 010 0110 01110 (00) 01111 00 (00) 00
if we the change in frequency probably we can get 4700 bits we save 25 bytes
This idea was
omran farhat

12. Originally Posted by omran farhat
Is there a compression algorithm works this way?
No. As I explained, changing character frequencies loses information, so it is not possible to restore the original text.

13. ## Thanks:

omran farhat (22nd September 2016)

14. Each character is unique. What Matt Mahoney said, if you change frequencies you mess up the original information.
Frequencies in itself store some information about the content. One can think philosophically about what this frequency-information actually means for the wholeness.
If you want to make one-symbol out of other symbols (in your example: TEXT FILE = ((HDADGABHAEABABEAEGEDEBHE..."):
HA AD GA BH AE AB AB EA EG ED EB HE in pairs instead of ones, you get something like:
HA=1, AD=1, GA=1, BH=1, AE=1, AB=2, EA=1, EG=1, ED=1, EB=1, HE=1,...

it's better to do this bitwise (instead of bytewise) though since the frequency information is closer to the original if you change bit-length.
you can then make a table (or dictionary of some sorts) of all the pair-instances, do a sorting-algorithm that is reversible and exclude the pairs that does not exist to save space.
depending on the bitlength of each symbol-pair you are able to reduce the frequency information, but you increase the complexity of the symbol-lookups.
wether or not one is able to compress by this method i dont know. how to store and decode information about this complexity is not on my list to figure out right now.
what you are able to achieve with this method is some kind of transformation about how to interpret the data atleast..

15. ## Thanks:

omran farhat (26th October 2016)

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•