# Thread: TOP4 algorithm

1. Hmm. At first I thought it might be the header, but the result output is about half of what the demo program result is. I'll have to look at it again in the morning and see if there's something missed. Your code looks right for the counts, but I'll recompile and see if I can determine where the difference is.

Originally Posted by jibz
I probably missed something in your description of the algorithm (or, equally likely I got the code below wrong), but I can't seem to get the numbers to add up. I wrote a small program that counts symbol frequencies, sorts them, and computes the encoded length of assigning 7 bits to the highest 4 and 9 bits to the lowest 8 for the file sample.mp3:

```#include <algorithm>
#include <functional>
#include <numeric>
#include <iostream>
#include <vector>

#include <stdio.h>
#include <stdlib.h>

int main()
{
std::vector<size_t> count(256, 0);
FILE *fp = fopen("sample.mp3", "rb");
int ch;

while ((ch = fgetc(fp)) != EOF) {
count[ch]++;
}

std::sort(count.begin(), count.end(), std::greater<>());

std::cout << std::accumulate(count.begin(), count.end(), 0) << " input bytes\n";

size_t out_bits = 0;

for (int i = 0; i < 256; ++i) {
if (i < 4) {
out_bits += 7 * count[i];
}
else if (i < 248) {
out_bits += 8 * count[i];
}
else {
out_bits += 9 * count[i];
}
}

std::cout << (out_bits + 7) / 8 << " output bytes\n";
}
```

Output is:

Code:
```4829825 input bytes
4816362 output bytes```

2. Ok, I see what happened. I have 3 versions of this I was working on, and may have compiled one of the other two with the modifications still there. What you're getting is closest to the basic algorithm without the header added. The results from the compiled version are reflecting one if not two additional modifications I meant to pull out before compiling, but you can implement them easily to match or get similar if not the same results.

You would need to expand one more of the 8 bit codes from the table to 9 bits by adding an extra 0 and 1 creating 10 patterns of 9 rather than 8, starting at -1 position of 246 to 255 instead of 247 to 255 :

for (int i = 0; i < 256; ++i) { if (i < 4) {

out_bits += 7 * count[i];

}

else if (i < 247) {

out_bits += 8 * count[i];

}

else {

out_bits += 9 * count[i];

}

}

The extra two patterns can then be used for an eof symbol and header, or they can be used to do 1 or two modifications. 2 of the modifications made (one that I'm seeing closer results to with the output) is to group ascii character (0)+(0) together to compress each occurrence to 4.5 bits. The other is to use the other 9 bit code to represent the last character seen twice if detected from the stream:

i.e:

-------

Defined variables:
Last, Next, and AnotherByte are Strings

Last = last character

Read NextSymbol as a byte

If NextSymbol = Last, then read AnotherByte

If Another = Last, then output 9 bit code for Last + Last (a match of 2 symbols has been found and compresses to 1 pattern of 9 bits)

If Another <> Last, then output code for Another and Last separately where they are on the table (no match found; represent each symbol with what they are on the table)

-------

Try adding one or both of those modifications to your code and see if you get closer or same to the results. I was experimenting with this and a few other modifications (greedy bit tailoring after converting the symbols to binary as well) but I may have included the compiled version for one or both of the above modifications with the demo when building the windows exe version. You'll still get the gains without them after post-LZ processing, but not as much as with them present.

3. Originally Posted by JamesWasil

Try adding one or both of those modifications to your code and see if you get closer or same to the results. I was experimenting with this and a few other modifications (greedy bit tailoring after converting the symbols to binary as well) but I may have included the compiled version for one or both of the above modifications with the demo when building the windows exe version. You'll still get the gains without them after post-LZ processing, but not as much as with them present.
I think these additions are part of the trouble I had with your statements about TOP4 - the way you described it appeared to be coding one symbol at a time, based on the byte frequencies, and yet somehow exceed what a Huffman code would do under _the same conditions_. The additional modifications clearly don't just code each symbol individually, you introduce some sort of context (last symbol, paired 0s, etc). Creating a Huffman code with an alphabet that can also code for the last symbol and for (0)+(0), etc, would then be a fairer comparison against the perceived issues of Huffman.

4. ## Thanks (2):

JamesWasil (10th October 2018),jibz (10th October 2018)

5. Originally Posted by Stefan Atev
I think these additions are part of the trouble I had with your statements about TOP4 - the way you described it appeared to be coding one symbol at a time, based on the byte frequencies, and yet somehow exceed what a Huffman code would do under _the same conditions_. The additional modifications clearly don't just code each symbol individually, you introduce some sort of context (last symbol, paired 0s, etc). Creating a Huffman code with an alphabet that can also code for the last symbol and for (0)+(0), etc, would then be a fairer comparison against the perceived issues of Huffman.
Understandable. Even without the minimal context modeling, it will still exceed Huffman without these changes, but only if the frequencies are too narrow and random for powers of 2 to effectively represent the optimal bit patterns. If you code them individually, you'll still get marginal gains over Huffman without them but only about 1.5%, but usually only after a dictionary algorithm of some kind has passed over it first.

If the data is still redundant, then you'll get more from Huffman if you're only using Huffman. If you use it as a final pass compressor on precompressed or non-redundant data, that's where you'll see an improvement, and that improvement will be there with or without the use of Huffman (i.e: whether you use huffman + lz77 or just lz77 for example, you can still use top4 afterward in most cases on the output as a final pass compressor to do better than huffman used a second time on that same data).

Huffman does better when the data is still mostly redundant, Top4 does better when the data is no longer redundant enough. From that point on, if the data expands, it'll expand less with Top4 than with Huffman. If it is still compressible, it'll compress more with Top4 than with Huffman at the end. In some cases where Huffman expands it slightly, Top4 will still compress it slightly instead.

If you only use the basic method without any additions, you'll get results close to what jibz had plus or minus space for the header. If you use 0+0 and last byte x 2, you'll get the results that the demo compile does. To get more than that, you can grab the highest single symbols and (to a point) represent them as 4.5 bps with the extended 9 bit codes and transforming the least of the 8 bit codes to 2 9 bit codes each if the statistics at the bottom stay shorter than those above it.

You could include other adaptations to get more, but most of those gains are already made by a good dictionary or BWT algorithm ahead of it.

6. Hopefully this will still be useful for some people with projects. Due to the patent issues on a few arithmetic implementations still and ease of use to make in any language with a few lines of code, I hope that it will be.

I see people mostly using dictionary + arithmetic these days, but anyone who still uses huffman can replace it with this code at the end stage to get 1.5% or more savings.

As Jyrki pointed out, unless it's fully optimized there's going to be a speed hit, since it still needs to gather statistics and then assign variable length codes like other statistical coders do. If the speed hit is the same as an arith binary coder, then you might want to use that to get the better results. If it's less of a speed hit, then top4 will give you more savings than huffman but less than arithmetic there. If it dosen't matter and is speed-based, then huffman or output can be left as-is after the final pass.

7. Originally Posted by JamesWasil
Yes, you need to stop there. You're the one who was wrong on the internet. I've had decades of research and programming since 1986, but thank you. Hope you got some sleep. zzz
Usually I would not have replied here, because I ignore personal accusations. Let me explain, why I still do in this case. JamesWasil can safely ignore this reply, since they already made their opion and no longer want to learn.

The 'encode.su' forum is probably read by thousands of users world wide. Some of them might be experts, who answer questions about existing algorithm, or (if they are genius) even invent new algorithms. Many other readers, however, come here to learn, and the Huffman algorithm, while pretty old and simple, might still be new to many readers.

As a (math and computer science) teacher, I am both interested to understand old algorithms, as well as interested in any new algorithms that others may have invented. Neither would I claim that I am a genius to invent something new, nor would I claim that I understand every old algorithm. For example, I am very sure I still have not understood the ANS family of algorithms in a way that I could write code for these without help.

But what I am also interested in is that an open forum such as 'encode.su' presents only factual correct information, and if it is wrong, that someone corrects it.

Someone said:

"Once your frequencies become nearly even, at that point, if you run Huffman instead of Top4, Huffman is going to generate the shortest codes it can and it may even give you 6 bit codes rather than 7 to start with..."

This is factually wrong, and I add a simple C program that runs the Huffman algorithm for a "nearly even" frequency distribution.

It first outputs a sample frequency distribution for 256 symbols, here called '00' ... 'FF'. Then it outputs its decisions to build a tree by merging the two symbols with the least frequency. Finally, it outputs the number of bits needed according to the Huffman tree.

The result is that there are two codes assigned a length of 7 bits, four codes assigned a length of 9 bits, and the remaining 250 have a length of 8 bits. Making the distribution even more flat increases the number of symbols with 8 bits, up to a point where all 256 symbols are assigned a length of 8 bits, causing no expansion. This can be demonstrated by making the constant '500' larger.

Only if the distribution gets more distorted, Huffman increases the number of 7 bit symbols, up to a point where it even decides to use shorter (6 bit or less) codes. It does it based on the frequencies to minimize the total code length. This can be demonstrated by making the constant '500' smaller.

Code:
```#include <stdio.h>
#include <stdlib.h>

// maximum number of symbols
#define N 256

// instead of a tree, we store new nodes in the second half of our arrays
int freq[2 * N];
int parent[2 * N];

void huffman()
{
int symbols = 0;
int i;

// initialize all nodes to have no parent
for (i = 0; i < 2 * N; ++i) {
parent[i] = -1;
}

// count actual number of symbols
for (i = 0; i < N; ++i) {
if (freq[i] > 0) ++symbols;
}

// index of next new node
int next = N;

// as long as we have at least two remaining symbols, apply the Huffman algorithm
while (symbols >= 2) {
// STEP 1: find two elements with the smallest frequency > 0
int i1 = -1;
int i2 = -1;
for (i = 0; i < next; ++i) {
if (freq[i] > 0) {
if (i1 < 0) {
i1 = i;
} else if (i2 < 0) {
i2 = i;
} else {
if (freq[i1] < freq[i2]) {
if (freq[i] < freq[i2]) {
i2 = i1;
i1 = i;
}
} else {
if (freq[i] < freq[i1]) {
i1 = i2;
i2 = i;
}
}
}
}
}
printf("two smallest: freq[%d]=%d and freq[%d]=%d\n", i1, freq[i1], i2, freq[i2]);

// STEP 2: create a new node with cumulated frequency
freq[next] = freq[i1] + freq[i2];
parent[i1] = next;
parent[i2] = next;
// make sure we do not find these again
freq[i1] = -1;
freq[i2] = -1;
--symbols;
++next;
}
}

int main(int argc, char *argv[])
{
int i;

for (i = 0; i < 256; ++i) {
freq[i] = 500 - i;
printf("%02X: frequency=%d\n", i, freq[i]);
}

huffman();

for (i = 0; i < N; ++i) {
int nbits = 0;
int j = i;
// count nodes up to root
while (parent[j] >= 0) {
j = parent[j];
++nbits;
}
if (nbits > 0) {
printf("%02X: bits=%d\n", i, nbits);
}
}

return 0;
}```

8. ## Thanks (2):

jibz (12th October 2018),Stefan Atev (12th October 2018)

9. As I said before, you need to just stop. I refuse to maintain baseless criticism as valid from someone who claims to be a teacher in academia, yet whose only product or contribution to the community thus far was a program not much better than PNG for images containing stolen JPEG2000 code (i.e: imagezero). I was trying to avoid any mention of that or pointing that out to be kind to you when I asked you to stop while you were ahead.

I am always interested to learn from others and contribute wherever and whenever possible. As your response illustrated, no one is perfect, and even the most experienced are inclined to learn from a variety of sources and levels throughout the communities that exist both online and in the real world. As long as we never lose sight of that, we all grow together through our contributions but remain stagnant if we place criticisms before our understanding.

Your understanding of my algorithm was flawed, as was your comparison of it to Huffman when you did not understand how it worked or why it worked better than Huffman when given constraints of redundancy elimination following post-processing.

Let's get back to what we're really here for, data compression and application of it!

Yes, that is how Huffman works. But there are static Huffman generations and modified trees which do generate 6 bits to start instead of 7, at least by 1. One such sample output generated from a tree that began with these codelengths is as follows:

0000111111, 10
0100011111, 10
0100111111, 10
0110011111, 10
0110111111, 10
1000011111, 10
1000111111, 10
1010011111, 10
1010111111, 10
1011111111, 10
1011011111, 10
1100011111, 10
1100111111, 10
1101111111, 10
1101011111, 10
1111111111, 10
1111011111, 10
1110111111, 10
1110011111, 10
0000111110, 10
0100011110, 10
0100111110, 10
0110011110, 10
0110111110, 10
1000011110, 10
1000111110, 10
1010011110, 10
1010111110, 10
1011111110, 10
1011011110, 10
1100011110, 10
1100111110, 10
1101111110, 10
1101011110, 10
1111111110, 10
1111011110, 10
1110111110, 10
1110011110, 10
000111111, 9
001111111, 9
001011111, 9
010111111, 9
011111111, 9
100111111, 9
000111110, 9
001111110, 9
001011110, 9
010111110, 9
011111110, 9
100111110, 9
000011110, 9
010001110, 9
010011110, 9
011001110, 9
011011110, 9
100001110, 9
100011110, 9
101001110, 9
101011110, 9
101111110, 9
101101110, 9
110001110, 9
110011110, 9
110111110, 9
110101110, 9
111111110, 9
111101110, 9
111011110, 9
111001110, 9
00000111, 8
00000110, 8
00000101, 8
00000100, 8
00011110, 8
00111110, 8
00101110, 8
01011110, 8
01111110, 8
10011110, 8
00001110, 8
01000110, 8
01001110, 8
01100110, 8
01101110, 8
10000110, 8
10001110, 8
10100110, 8
10101110, 8
10111110, 8
10110110, 8
11000110, 8
11001110, 8
11011110, 8
11010110, 8
11111110, 8
11110110, 8
11101110, 8
11100110, 8
00001101, 8
01000101, 8
01001101, 8
01100101, 8
01101101, 8
10000101, 8
10001101, 8
10100101, 8
10101101, 8
10111101, 8
10110101, 8
11000101, 8
11001101, 8
11011101, 8
11010101, 8
11111101, 8
11110101, 8
11101101, 8
11100101, 8
00001100, 8
01000100, 8
01001100, 8
01100100, 8
01101100, 8
10000100, 8
10001100, 8
10100100, 8
10101100, 8
10111100, 8
10110100, 8
11000100, 8
11001100, 8
11011100, 8
11010100, 8
11111100, 8
11110100, 8
11101100, 8
11100100, 8
00001011, 8
01000011, 8
01001011, 8
01100011, 8
01101011, 8
10000011, 8
10001011, 8
10100011, 8
10101011, 8
10111011, 8
10110011, 8
11000011, 8
11001011, 8
11011011, 8
11010011, 8
11111011, 8
11110011, 8
11101011, 8
11100011, 8
00001010, 8
01000010, 8
01001010, 8
01100010, 8
01101010, 8
10000010, 8
10001010, 8
10100010, 8
10101010, 8
10111010, 8
10110010, 8
11000010, 8
11001010, 8
11011010, 8
11010010, 8
11111010, 8
11110010, 8
11101010, 8
11100010, 8
00001001, 8
01000001, 8
01001001, 8
01100001, 8
01101001, 8
10000001, 8
10001001, 8
10100001, 8
10101001, 8
10111001, 8
10110001, 8
11000001, 8
11001001, 8
11011001, 8
11010001, 8
11111001, 8
11110001, 8
11101001, 8
11100001, 8
00001000, 8
01000000, 8
01001000, 8
01100000, 8
01101000, 8
10000000, 8
10001000, 8
10100000, 8
10101000, 8
10111000, 8
10110000, 8
11000000, 8
11001000, 8
11011000, 8
11010000, 8
11111000, 8
11110000, 8
11101000, 8
11100000, 8
0001110, 7
0011110, 7
0010110, 7
0101110, 7
0111110, 7
1001110, 7
0001101, 7
0011101, 7
0010101, 7
0101101, 7
0111101, 7
1001101, 7
0001100, 7
0011100, 7
0010100, 7
0101100, 7
0111100, 7
1001100, 7
0001011, 7
0011011, 7
0010011, 7
0101011, 7
0111011, 7
1001011, 7
0001010, 7
0011010, 7
0010010, 7
0101010, 7
0111010, 7
1001010, 7
0001001, 7
0011001, 7
0010001, 7
0101001, 7
0111001, 7
1001001, 7
0001000, 7
0011000, 7
0010000, 7
0101000, 7
0111000, 7
1001000, 7
000000, 6

It did not have to change to 6 bits from 7 for the above example because it already started with it for a read data set which can happen, as I mentioned earlier.

Unfortunately Top4 has to start with 7 bits (you could make it go for 6, but I didn't see much improvement by making the first 6 and additional 9 at the bottom or 2 of 10 for the last 2 symbols) but it keeps the output codes shorter staying at 7 than what Huffman is going to adapt to when it realizes it needs to produce shorter codewords for the remainder of the data that it is reading from a file or a stream.

Top4 doesn't change those, and (with or without the modifications to gain more) will give you a percentage of patterns shorter than huffman when it encounters those, yielding additional compression when those bits are compressed out each time.

Huffman is going to be optimal when your data aligns to distributions that are a power of 2 and when your data is redundant, but when the distribution it encounters is too flat, you are able to get a few more bit patterns to compress than what is algorithmically done by Huffman by having only a slight reduction at the top of 1 bit for every 2 grown at the bottom, which is what top4 is.

You won't get Huffman to do that with any implementation (be it yours above or an adaptive one) because even with the normally optimal binary trees, it isn't able to see half-states where having 1 bit pattern shorter and 2 larger statistically will make it compress more.

Top4 gets those because it stays as close to flat as it can while still having a way to compress the most common differential of the space left between those counts.

If Huffman were able to see those, then jibz results and my demo code would not have produced the optimal results they did over huffman with files that are more flat.

Originally Posted by cfeck
Usually I would not have replied here, because I ignore personal accusations. Let me explain, why I still do in this case. JamesWasil can safely ignore this reply, since they already made their opion and no longer want to learn.

The 'encode.su' forum is probably read by thousands of users world wide. Some of them might be experts, who answer questions about existing algorithm, or (if they are genius) even invent new algorithms. Many other readers, however, come here to learn, and the Huffman algorithm, while pretty old and simple, might still be new to many readers.

As a (math and computer science) teacher, I am both interested to understand old algorithms, as well as interested in any new algorithms that others may have invented. Neither would I claim that I am a genius to invent something new, nor would I claim that I understand every old algorithm. For example, I am very sure I still have not understood the ANS family of algorithms in a way that I could write code for these without help.

But what I am also interested in is that an open forum such as 'encode.su' presents only factual correct information, and if it is wrong, that someone corrects it.

Someone said:

"Once your frequencies become nearly even, at that point, if you run Huffman instead of Top4, Huffman is going to generate the shortest codes it can and it may even give you 6 bit codes rather than 7 to start with..."

This is factually wrong, and I add a simple C program that runs the Huffman algorithm for a "nearly even" frequency distribution.

It first outputs a sample frequency distribution for 256 symbols, here called '00' ... 'FF'. Then it outputs its decisions to build a tree by merging the two symbols with the least frequency. Finally, it outputs the number of bits needed according to the Huffman tree.

The result is that there are two codes assigned a length of 7 bits, four codes assigned a length of 9 bits, and the remaining 250 have a length of 8 bits. Making the distribution even more flat increases the number of symbols with 8 bits, up to a point where all 256 symbols are assigned a length of 8 bits, causing no expansion. This can be demonstrated by making the constant '500' larger.

Only if the distribution gets more distorted, Huffman increases the number of 7 bit symbols, up to a point where it even decides to use shorter (6 bit or less) codes. It does it based on the frequencies to minimize the total code length. This can be demonstrated by making the constant '500' smaller.

Code:
```#include <stdio.h>
#include <stdlib.h>

// maximum number of symbols
#define N 256

// instead of a tree, we store new nodes in the second half of our arrays
int freq[2 * N];
int parent[2 * N];

void huffman()
{
int symbols = 0;
int i;

// initialize all nodes to have no parent
for (i = 0; i < 2 * N; ++i) {
parent[i] = -1;
}

// count actual number of symbols
for (i = 0; i < N; ++i) {
if (freq[i] > 0) ++symbols;
}

// index of next new node
int next = N;

// as long as we have at least two remaining symbols, apply the Huffman algorithm
while (symbols >= 2) {
// STEP 1: find two elements with the smallest frequency > 0
int i1 = -1;
int i2 = -1;
for (i = 0; i < next; ++i) {
if (freq[i] > 0) {
if (i1 < 0) {
i1 = i;
} else if (i2 < 0) {
i2 = i;
} else {
if (freq[i1] < freq[i2]) {
if (freq[i] < freq[i2]) {
i2 = i1;
i1 = i;
}
} else {
if (freq[i] < freq[i1]) {
i1 = i2;
i2 = i;
}
}
}
}
}
printf("two smallest: freq[%d]=%d and freq[%d]=%d\n", i1, freq[i1], i2, freq[i2]);

// STEP 2: create a new node with cumulated frequency
freq[next] = freq[i1] + freq[i2];
parent[i1] = next;
parent[i2] = next;
// make sure we do not find these again
freq[i1] = -1;
freq[i2] = -1;
--symbols;
++next;
}
}

int main(int argc, char *argv[])
{
int i;

for (i = 0; i < 256; ++i) {
freq[i] = 500 - i;
printf("%02X: frequency=%d\n", i, freq[i]);
}

huffman();

for (i = 0; i < N; ++i) {
int nbits = 0;
int j = i;
// count nodes up to root
while (parent[j] >= 0) {
j = parent[j];
++nbits;
}
if (nbits > 0) {
printf("%02X: bits=%d\n", i, nbits);
}
}

return 0;
}```

10. Originally Posted by JamesWasil
If Huffman were able to see those, then jibz results and my demo code would not have produced the optimal results they did over huffman with files that are more flat.
Can I just point out that Huffman does better than the result of my calculation for the basic top4 algorithm on the file sample.mp3. The result of top4 was 4,816,362 and basic Huffman coding (no stop symbol) gives 4,811,584. This is because Huffman assigns one 6 bit code to 00.

If I modify the frequency of 00 to be like the other codes assigned 7 bits, then Huffman produces exactly the same code lengths as top4 (4 length 7 and 8 length 9), and the output file is consequently 4,816,362 exactly like top4. If the distribution were entirely flat, then Huffman would assign 8 bits to each symbol and break even.

The potential benefit of top4 could be that it requires very little setup and has a very simple table and decode step. But as far as my understanding of the basic version goes, it does not beat Huffman, simply because given a frequency distribution that would favor top4, Huffman creates the same code lengths. The extensions you mention may of course, if they do more than symbolwise independent coding.

11. ## Thanks:

JamesWasil (13th October 2018)

12. Created top4 in Visual Basic .NET (Framework 4.7.2)

Syntax:
top4 e input output
top4 d input output

enwik8 100,000,000 bytes:
95,826,517 bytes, 158.582 sec., - 76.268 sec.

enwik9 1,000,000,000 bytes:
958,856,207 bytes, 1,571.818 sec. - 759.162 sec.

13. ## Thanks (2):

JamesWasil (13th October 2018),moisesmcardona (14th October 2018)

14. Here is the source code to one of the other original versions I did of this that is faster and has a few other slight changes to it. Releasing as public domain.

This one is in QB64 / Quickbasic 64 which is a 64 bit version of Basic that compiles for Mac, Windows, and Linux. It will also compile and run on Chromebook and (possibly) traditional Android devices.

The first version was done in pascal/delphi, but I doubt anyone will use that one. This can be easily ported to VB6 by adding a form and replacing write/locate with .caption or .text on a textbox or Label to update data positions.

It should be easy to convert this to Python, PHP, or any other language you fancy.

TOP4 COMPRESS:
Code:
```'TOP4 COMPRESS BASIC/Quickbasic/QB64 implementation by James Wasil 2018
'TOP4 is a compression method I came up with that can replace Huffman for some post-LZ and other final pass compression methods. It can be used that way or standalone.
'It's easy to use, easy to implement, easy to convert to other programming languages, straightforward, and treeless.
'This version will work with the 32 or 64 bit version of the QB64 compiler for Windows, Mac OS X, and Linux
'freely available at http://www.qb64.net or http://www.qb64.org
'It can work with classic Quickbasic or Qbasic from DOSBOX.

'Top4 can be made to work with Visual Basic 6.0 or higher by adding a form and placing this code in the main. The locate/write visual updates may be replaced with
'Caption changes to a .Text of a TextBox if you like. The rest should work as-is.

'This is released as public domain to use, improve upon, or do as you like with it for data compression.
'If you find this helpful or useful, please give credit or leave my name in the code. That's all I ask. :)
'Feel free to send feedback to: james.wasil@gmail.com
'Enjoy!

'
'Array definitions:
'P1\$() =   ASCII symbols from 0 to 255
'P2()  =   Numeric array that holds count of symbols
'P3\$() =   Binary string array from 0 to 255 that holds variable-sized bit patterns
'BUFFER\$ = Temporary file buffer of up to 32767 bytes
'Z2\$ =     Working string buffer of data read from BUFFER\$ out of the file
'OUTP\$ =   Output string for binary patterns. These get sent to the disk after there are enough to convert to an 8 bit byte.

DIM P1\$(256), P2(256), P3\$(256): P = P - 1
FOR T = 0 TO 3: P = P + 1: TEMP\$ = CHR\$(T): CALL Ascii.To.Bynary(TEMP\$, OUT1\$): P3\$(P) = RIGHT\$(OUT1\$, 7): OUT1\$ = "": NEXT T: 'ADD FIRST 4 PATTERNS THAT ARE 7 BITS TO P3\$(P), WHERE P IS ALWAYS +1.
FOR T = 8 TO 249: P = P + 1: TEMP\$ = CHR\$(T): CALL Ascii.To.Bynary(TEMP\$, OUT1\$): P3\$(P) = OUT1\$: OUT1\$ = "": NEXT T: 'ADD NEXT 248 PATTERNS THAT ARE 8 BITS TO P3\$(P), WHERE P IS ALWAYS +1.
FOR T = 250 TO 254: P = P + 1: TEMP\$ = CHR\$(T): CALL Ascii.To.Bynary(TEMP\$, OUT1\$): P3\$(P) = OUT1\$ + "0": P = P + 1: P3\$(P) = OUT1\$ + "1": OUT1\$ = "": NEXT T: 'ADD LAST 8 PATTERNS THAT ARE 9 BITS TO P3\$(P), WHERE P IS ALWAYS +1.
'SLIGHTLY MODIFIED FROM ORIGINAL TOP4 PATTERN TO MAKE SPACE FOR AN EOF SYMBOL.
'TEMP\$=CHR\$(255): P=P+1: CALL Ascii.To.Bynary(TEMP\$, OUT1\$): P3\$(P) = OUT1\$ + "0": P = P + 1: P3\$(P) = OUT1\$ + "1": OUT1\$ = "": 'ADD STOP/EOF SYMBOL AND WILDCARD\$

FOR T = 0 TO 255: P1\$(T) = CHR\$(T): NEXT T: 'CREATE STANDARD ASCII TABLE

LINE INPUT "File to read:", FILE1\$     :'FILE TO READ FROM
LINE INPUT "File to write to:", FILE2\$ :'FILE TO OUTPUT TO
OPEN FILE1\$ FOR BINARY AS #1
OPEN FILE2\$ FOR OUTPUT AS #2: CLOSE #2: OPEN FILE2\$ FOR BINARY AS #2

DO
IF LEN(Z2\$)<=1 THEN
DO
IF LOF(1)-LOC(1)=>32767 THEN BUFFER\$=STRING\$(32767,"0") ELSE BUFFER\$=STRING\$(LOF(1)-LOC(1),"0")
IF LOF(1)-LOC(1)=>0 THEN GET #1,,BUFFER\$:Z2\$=Z2\$+BUFFER\$
LOOP UNTIL LOF(1)-LOC(1)=<0 OR LEN(Z2\$)>=1
ELSE
END IF

Z\$ = LEFT\$(Z2\$,1):Z2\$=RIGHT\$(Z2\$,LEN(Z2\$)-1)
IF LEN(Z\$)=>1 THEN P2(ASC(Z\$)) = P2(ASC(Z\$)) + 1
LOCATE 1, 1: WRITE LOF(1) - LOC(1)
LOOP UNTIL Z\$=""

DO
FOUND = 0
FOR T = 0 TO 254
IF P2(T) < P2(T + 1) THEN TEMP1 = P2(T): TEMP2\$ = P1\$(T): P2(T) = P2(T + 1): P2(T + 1) = TEMP1: P1\$(T) = P1\$(T + 1): P1\$(T + 1) = TEMP2\$: FOUND = 1
NEXT T
LOOP UNTIL FOUND = 0

'MAKE STATIC BYTE HEADER HERE. NOW THAT IT IS ARRANGED STATISTICALLY, IT CORRESPONDS FROM GREATEST TO LEAST OCCURENCES WITH THE P3\$() ARRAY BIT PATTERNS.
for T=0 to 255:HEADER\$=HEADER\$+P1\$(T):NEXT T:PUT #2,,HEADER\$:' This adds an easy to restore 256 byte header. It is possible to reduce this to 12 bytes and build around it to restore.

'ADD THE FILESIZE NEXT. The file size is added to the second part of the header after the 256 byte table. For easy implementation, file size digits are stored as raw
'8 bit bytes. The file size can be any size this way, but if you want to save space this can be a static 4 byte file header to support up to 4gb, even though the overhead isn't
'that large with this dynamic size header, it may make a difference of a few bytes for smaller files:

FILESIZE\$=LTRIM\$(STR\$(LOF(1)))+"E":'E FOR END
PUT #2,,FILESIZE\$

'Set the file pointer to the first position. Z\$ is the same as BUFFER\$ here, but needs to be initialized with any symbol of at least 1 byte. It can be up to 32767 bytes or more, but gets slower if too large
'of a string is used.

SEEK #1, 1: Z\$ = "P"

CLS

'Get up to 32767 bytes, use Z3\$ as a BUFFER\$, add the new bytes to Z2\$

DO
IF LOF(1)-LOC(1)=>32767 AND LEN(Z2\$)<2 THEN Z3\$ = STRING\$(32767,"0")
IF LOF(1)-LOC(1)<32767 AND LEN(Z2\$)<2 THEN Z3\$="0"
IF LEN(Z2\$)<2 AND LOF(1)-LOC(1)=>1 THEN
DO
GET #1, , Z3\$:Z2\$=Z2\$+Z3\$
LOOP UNTIL LEN(Z2\$)=>1 OR LOF(1)-LOC(1)<1
END IF

LOCATE 1, 1: WRITE LOC(1):WRITE LOF(2):'Write the position we're at from FILE1\$ and the size of the new FILE2\$ and update it. Compression goes slightly faster without this visual update.

'A few modifications:

'Modification 1: If ASCII symbol 0+0 is seen, then compress each to 4.5 bits and group the occurence together as 1 9 bit pattern.
IF LEFT\$(Z2\$,2)=CHR\$(0)+CHR\$(0) THEN
OUTP\$=OUTP\$+"111111110"
Z2\$=RIGHT\$(Z2\$,LEN(Z2\$)-2):'Remove the 2 bytes from the input file stream after we've output the binary result

'Modification 2: If we see the last symbol output appear as the next 2 symbols, we are able to group those 2 symbols together as 4.5 bits and output as 1 9 bit pattern:
ELSEIF Z2\$=LAST\$+LAST\$ THEN
OUTP\$=OUTP\$+"111111111"
Z2\$=RIGHT\$(Z2\$,LEN(Z2\$)-2):'Remove the 2 bytes from the input file stream after we've output the binary result

'Proceed with normal ASCII to binary lookup table here:
ELSE
Z\$=LEFT\$(Z2\$,1):Z2\$=RIGHT\$(Z2\$,LEN(Z2\$)-1)
FOR PP = 0 TO 255: IF P1\$(PP) = Z\$ THEN OUTP\$ = OUTP\$ + P3\$(PP):LAST\$=Z\$:Z\$="":EXIT FOR: 'ADD THE BINARY REFERENCE PATTERN BY ASCII VALUE, BASED ON THE STATISTICAL ARRANGEMENT OF THE SYMBOLS HERE.
NEXT PP
END IF

'If we're not at the end of the file yet
IF LOF(1)-LOC(1)=<0 THEN

'See if we are able to output a byte to the disk from OUTP\$. If it's at least 8 bits, convert to a byte and send it out:
IF LEN(OUTP\$) >= 8 THEN TEMP\$ = LEFT\$(OUTP\$, 8): OUTP\$ = RIGHT\$(OUTP\$, LEN(OUTP\$) - 8): CALL Bynary.To.Ascii(TEMP\$, OUTP2\$): PUT #2, , OUTP2\$: OUTP2\$ = ""
EXIT DO
ELSE
END IF

'IF WE HAVE 8 BITS, THEN OUTPUT A BYTE FROM THE LEFT. SAME AS ABOVE, BUT OUTSIDE OF THE MAIN LOOP:
IF LEN(OUTP\$) >= 8 THEN
DO
TEMP\$ = LEFT\$(OUTP\$, 8): OUTP\$ = RIGHT\$(OUTP\$, LEN(OUTP\$) - 8): CALL Bynary.To.Ascii(TEMP\$, OUTP2\$): PUT #2, , OUTP2\$: OUTP2\$ = ""
LOOP UNTIL LEN(OUTP\$)<8
END IF
LOOP UNTIL LOF(1)-LOC(1)=<0 AND Z2\$="" AND LEN(OUTP\$)<8

'OUTSIDE OF THE LOOP AND READY TO FINISH:
'IF THERE ARE STILL BITS PRESENT WITH OUTP\$, THEN PAD IT TO 8 BITS WITH EXTRA ZEROS THEN OUTPUT THE LAST BYTE.
IF LEN(OUTP\$) <> 0 THEN OUTP\$ = OUTP\$ + STRING\$(8 - LEN(OUTP\$), "0"): OUTP\$ = RIGHT\$(OUTP\$, LEN(OUTP\$) - 8): CALL Bynary.To.Ascii(TEMP\$, OUTP2\$): PUT #2, , OUTP2\$: OUTP2\$ = ""
WRITE "Original File:", LOF(1)
WRITE "Compressed Output:", LOF(2)
WRITE "Difference:", LOF(1) - LOF(2)
CLOSE #1
CLOSE #2
END

'Functions / Subs area.
SUB Ascii.To.Bynary (X\$, OUTZX\$)
FOR K = 1 TO LEN(X\$)
KXZZ# = ASC(MID\$(X\$, K, 1))
IF KXZZ# - 128 >= 0 THEN KXZZ# = KXZZ# - 128: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
IF KXZZ# - 64 >= 0 THEN KXZZ# = KXZZ# - 64: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
IF KXZZ# - 32 >= 0 THEN KXZZ# = KXZZ# - 32: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
IF KXZZ# - 16 >= 0 THEN KXZZ# = KXZZ# - 16: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
IF KXZZ# - 8 >= 0 THEN KXZZ# = KXZZ# - 8: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
IF KXZZ# - 4 >= 0 THEN KXZZ# = KXZZ# - 4: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
IF KXZZ# - 2 >= 0 THEN KXZZ# = KXZZ# - 2: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
IF KXZZ# - 1 >= 0 THEN KXZZ# = KXZZ# - 1: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
NEXT K
END SUB

SUB Bynary.To.Ascii (X\$, KXX\$)
KXX\$ = ""
FOR X = 1 TO LEN(X\$) STEP 8
IF MID\$(X\$, X, 1) = "1" THEN XXZ = XXZ + 128
IF MID\$(X\$, X + 1, 1) = "1" THEN XXZ = XXZ + 64
IF MID\$(X\$, X + 2, 1) = "1" THEN XXZ = XXZ + 32
IF MID\$(X\$, X + 3, 1) = "1" THEN XXZ = XXZ + 16
IF MID\$(X\$, X + 4, 1) = "1" THEN XXZ = XXZ + 8
IF MID\$(X\$, X + 5, 1) = "1" THEN XXZ = XXZ + 4
IF MID\$(X\$, X + 6, 1) = "1" THEN XXZ = XXZ + 2
IF MID\$(X\$, X + 7, 1) = "1" THEN XXZ = XXZ + 1
KXX\$ = KXX\$ + CHR\$(XXZ): XXZ = 0
NEXT X
END SUB```

TOP4 DECOMPRESS (this can be merged or made a function/sub of the above to consolidate it as one program if desired)

Code:
```'TOP4 DECOMPRESS BASIC/Quickbasic/QB64 implementation by James Wasil 2018
'TOP4 is a compression method I came up with that can replace Huffman for some post-LZ and other final pass compression methods. It can be used that way or standalone.
'It's easy to use, easy to implement, easy to convert to other programming languages, straightforward, and treeless.
'This version will work with the 32 or 64 bit version of the QB64 compiler for Windows, Mac OS X, and Linux
'freely available at http://www.qb64.net or http://www.qb64.org
'It can work with classic Quickbasic or Qbasic from DOSBOX.

'Top4 can be made to work with Visual Basic 6.0 or higher by adding a form and placing this code in the main. The locate/write visual updates may be replaced with
'Caption changes to a .Text of a TextBox if you like. The rest should work as-is.

'This is released as public domain to use, improve upon, or do as you like with it for data compression.
'If you find this helpful or useful, please give credit or leave my name in the code. That's all I ask. :)
'Feel free to send feedback to: james.wasil@gmail.com
'Enjoy!

DIM P1\$(256), P2(256), P3\$(256): P = P - 1
FOR T = 0 TO 3: P = P + 1: TEMP\$ = CHR\$(T): CALL Ascii.To.Bynary(TEMP\$, OUT1\$): P3\$(P) = RIGHT\$(OUT1\$, 7): OUT1\$ = "": NEXT T: 'ADD FIRST 4 PATTERNS THAT ARE 7 BITS TO P3\$(P), WHERE P IS ALWAYS +1.
FOR T = 8 TO 249: P = P + 1: TEMP\$ = CHR\$(T): CALL Ascii.To.Bynary(TEMP\$, OUT1\$): P3\$(P) = OUT1\$: OUT1\$ = "": NEXT T: 'ADD NEXT 248 PATTERNS THAT ARE 8 BITS TO P3\$(P), WHERE P IS ALWAYS +1.
FOR T = 250 TO 254: P = P + 1: TEMP\$ = CHR\$(T): CALL Ascii.To.Bynary(TEMP\$, OUT1\$): P3\$(P) = OUT1\$ + "0": P = P + 1: P3\$(P) = OUT1\$ + "1": OUT1\$ = "": NEXT T: 'ADD LAST 8 PATTERNS THAT ARE 9 BITS TO P3\$(P), WHERE P IS ALWAYS +1.

FOR T = 0 TO 255: P1\$(T) = CHR\$(T): NEXT T: 'CREATE STANDARD ASCII TABLE

LINE INPUT "File to read:", FILE1\$
LINE INPUT "File to write to:", FILE2\$
OPEN FILE1\$ FOR BINARY AS #1
OPEN FILE2\$ FOR OUTPUT AS #2: CLOSE #2: OPEN FILE2\$ FOR BINARY AS #2
Z\$ = STRING\$(256, "0"): GET #1, , Z\$: HEADER\$ = Z\$: Z\$ = "P"
FOR T = 1 TO 256: P1\$(T - 1) = MID\$(HEADER\$, T, 1): NEXT T: 'Load HEADER\$ to P1\$() array.
'GET FILESIZE NEXT
Z\$="G"
DO
GET #1,,Z\$:IF Z\$="E" THEN EXIT DO ELSE FILESIZE\$=FILESIZE\$+Z\$
LOOP
SIZEFILE#=VAL(FILESIZE\$)

DO

'ENSURE WE HAVE ENOUGH BYTES
DO
Z\$=STRING\$(1024,"0"):IF LOF(1)-LOC(1)<32767 THEN Z\$="0"
IF LEN(OUTP\$)<16 AND LOF(1)-LOC(1)>0 THEN GET #1,,Z\$
'LOCATE 1, 1: WRITE LOC(1)
IF LEN(OUTP\$) < 16 THEN CALL Ascii.To.Bynary(Z\$, OUTP1\$): OUTP\$ = OUTP\$ + OUTP1\$: OUTP1\$ = ""
LOOP UNTIL LEN(OUTP\$)=>16 OR LOF(1)-LOC(1)=<0

LOCATE 1,1:WRITE LOF(1)-LOC(1)
IF LEFT\$(OUTP\$,9)="111111110" THEN OUTP\$ = RIGHT\$(OUTP\$, LEN(OUTP\$) - 9): OUTP2\$ = CHR\$(0)+CHR\$(0): PUT #2, , OUTP2\$
IF LEFT\$(OUTP\$,9)="111111111" THEN OUTP\$ = RIGHT\$(OUTP\$, LEN(OUTP\$) - 9): OUTP2\$ = LAST\$+LAST\$: PUT #2, , OUTP2\$
FOR PP = 0 TO 255: IF LEFT\$(OUTP\$, LEN(P3\$(PP))) = P3\$(PP) THEN OUTP\$ = RIGHT\$(OUTP\$, LEN(OUTP\$) - LEN(P3\$(PP))): OUTP2\$ = P1\$(PP): LAST\$=OUTP2\$:PUT #2, , OUTP2\$: EXIT FOR: 'ADD THE BINARY REFERENCE PATTERN BY ASCII VALUE, BASED ON THE STATISTICAL ARRANGEMENT OF THE SYMBOLS HERE.
NEXT PP
'WRITE OUTP\$
LOOP UNTIL LOF(2)=>SIZEFILE# AND LOF(1)-LOC(1)=<0
WRITE "Compressed File:", LOF(1)
WRITE "Original File:", LOF(2)
WRITE "Difference:", LOF(1) - LOF(2)
CLOSE #1
CLOSE #2
END

SUB Ascii.To.Bynary (X\$, OUTZX\$)
FOR K = 1 TO LEN(X\$)
KXZZ# = ASC(MID\$(X\$, K, 1))
IF KXZZ# - 128 >= 0 THEN KXZZ# = KXZZ# - 128: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
IF KXZZ# - 64 >= 0 THEN KXZZ# = KXZZ# - 64: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
IF KXZZ# - 32 >= 0 THEN KXZZ# = KXZZ# - 32: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
IF KXZZ# - 16 >= 0 THEN KXZZ# = KXZZ# - 16: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
IF KXZZ# - 8 >= 0 THEN KXZZ# = KXZZ# - 8: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
IF KXZZ# - 4 >= 0 THEN KXZZ# = KXZZ# - 4: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
IF KXZZ# - 2 >= 0 THEN KXZZ# = KXZZ# - 2: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
IF KXZZ# - 1 >= 0 THEN KXZZ# = KXZZ# - 1: OUTZX\$ = OUTZX\$ + "1" ELSE OUTZX\$ = OUTZX\$ + "0"
NEXT K
END SUB

SUB Bynary.To.Ascii (X\$, KXX\$)
KXX\$ = ""
FOR X = 1 TO LEN(X\$) STEP 8
IF MID\$(X\$, X, 1) = "1" THEN XXZ = XXZ + 128
IF MID\$(X\$, X + 1, 1) = "1" THEN XXZ = XXZ + 64
IF MID\$(X\$, X + 2, 1) = "1" THEN XXZ = XXZ + 32
IF MID\$(X\$, X + 3, 1) = "1" THEN XXZ = XXZ + 16
IF MID\$(X\$, X + 4, 1) = "1" THEN XXZ = XXZ + 8
IF MID\$(X\$, X + 5, 1) = "1" THEN XXZ = XXZ + 4
IF MID\$(X\$, X + 6, 1) = "1" THEN XXZ = XXZ + 2
IF MID\$(X\$, X + 7, 1) = "1" THEN XXZ = XXZ + 1
KXX\$ = KXX\$ + CHR\$(XXZ): XXZ = 0
NEXT X
END SUB```

15. ## Thanks:

Sportman (19th October 2018)

16. Originally Posted by JamesWasil
Here is the source code to one of the other original versions
QB64 compiled versions:

17. ## Thanks:

JamesWasil (19th October 2018)

Page 2 of 2 First 12

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•