# Thread: Best compression algorithm for a sequence of incremental integers

1. ## Best compression algorithm for a sequence of incremental integers

I have a huge array of incremental integers and I need to compress them.

All numbers are unique and progressively increasing, but some numbers are cutted-off from sample, thus it´s impossible to express it as a 1 - 6000.

I´ve readed one post at stackoverflow.com where is:
First, preprocess your list of values by taking the difference between each value and the previous one (for the first value, assume the previous one was zero). This should in your case give mostly a sequence of ones, which can be compressed much more easily by most compression algorithms.
And what if I will have separate text document that would contain the missing numbers? I don´t think so that this approach could lead to 100% compression ratio.

I expect at least 99.95% space saving - or 100% if it´s possible which I don´t think.

Thanks a lot.
CompressMaster

2. If I'm understanding what you're asking, you want to compress the integers in the text file, but it's not possible to represent them as an incremental loop from 0 to 6000 because there are digits missing?

What I would probably do is get the missing numbers first and use a 1 bit flag to represent the end markers between the digits while representing the rest as 3.32 bits per digit. The total would be 4.32 bits per digit, but that way the special decompressor for it only needs to see the flags and read the numbers to add and know where to do it, rather than storing all of the numbers that it doesn't which are sequential. If you already know which numbers aren't there and the range, you can make a really small footprint to model around it.

The 1 bit flag can be 0 if you're at the last digit of a number, or 1 if it is a digit that follows another. That way you can separate and have it know the difference between 20, 200, or 2000 if you have a series of digits that are 2020020200020220. When these digits are merged together, there'd be no way to tell where one starts and the other ended unless you used a 1 bit flag or had another method to differentiate them.

Alternatively, you can keep the 1 bit flags in a separate variable (string variable for example) and convert them to binary when done. That may be more advantageous because you can represent the missing digits as 3.32 bits per symbol still, but you can then statistically arrange them or use BWT and then MTF (move to front) on the digits to assign the smallest binary codes to those most frequently occurring.

That would introduce more complexity than you may want to do, because you'd have to move the binary flag and digit together as a pair every time you did a sort (same with the MTF of the digit afterward; you'd have to move the binary flag to the front along with the digit each time to make sure it stays together).

Initially, I'd opt for storing the 1 bit flags in string 1, use a dictionary compression method on the missing digits that are grouped together as string 2, and then for the final stage use an arithmetic encoder on the result to see what I got.

I haven't passed over integers.txt, but do you have a list of the integers which aren't there to attach instead? If you do, it'll be faster to analyze than writing a program to look for the ones that are missing.

3. I'd recommend writing the data as a series of 0/1 bits representing the presence of the value. Then the whole file could be a series of 751 0xDD and then you can just run length encode it and achieve the 99.95% compression ratio.

4. Is the attached sample a true representation of what to be compressed?
Each 4th integer is missing. (Missing is: 2,6,10,14,...) You don't need to compress anything. It can be generated.
`for(int i=0;i<=6000;++i)if((i%4)!=2)printf("%d\n",i);`

5. @JamesWasil

Unfortunately, I don´t have the list of missing numbers.

You mentioned that my data must be written in form of 0/1 bits. But my current sample does not contains "simply" pattern that can be easily followed and thus reconstructed. So, it would be still possible to compress down my sample to 99.95% regardless that?
Could I request you to do that? Because I don´t know how it can be compressed that data down.

@Gotty

Of course that was oversimplified sample. The correct sample is attached below (see much_more_integers.txt)
I´ve noticed prior that there is some pattern that can be easily followed and thus reversed as you´ve described. But, unfortunately, my new sample is very different from the first.

6. Originally Posted by CompressMaster
@JamesWasil

You mentioned that my data must be written in form of 0/1 bits. But my current sample does not contains "simply" pattern that can be easily followed and thus reconstructed. So, it would be still possible to compress down my sample to 99.95% regardless that?
Could I request you to do that? Because I don´t know how it can be compressed that data down.
Your data does not need to be written in any particular format, you can do whatever you want with it. Usually the simplest representation will compress best, so I'd say you are making things much more difficult for the compressor to recognize the data pattern by writing the sequence as a series of text integers than by just writing a 0 or 1 for each possible value. In general you will not be able to reduce the file by 99.95%, that is only possible when there is a simple pattern like your first sample. For random data, it would take a little more than 1 bit times the largest possible integer to represent the data.

7. @Compressmaster:
Sooo, what are these integers? What do they represent?

8. Without knowing more, over than some missing but most not:

1. Turn it around - record missing ones and drop the observed ones.
2. Replace N by delta of previous N.
3. Compress the numbers using a basic entropy encoder, maybe with zig-zag encoding for variable sized integers first if the scale is too large.

It's possible the missing values aren't random though - eg clustered in certain regions - in which case more advanced modelling will help.

Edit: actually with a good enough entropy encoder, ditch the negation step 1 too. All those "1"s with the delta will compress away just as well.

I get your example file to 2729 bytes.

9. There are 8130 numbers present out of 23001. Assuming a random distribution with the computed probability that a number is present, this needs 21555 bits (2695 bytes) according to Shannon. Add four bytes for the counts you could get this test file down to 2699 bytes. Assuming a non-random distribution requires one to model the data, and since the result in this case is more than 2700 bytes (tested using paq8px), I am pretty sure the test file is either randomly generated, or is the result of already compressed data.

10. ## Thanks:

Gotty (12th October 2018)

11. 2709 with cdm.
I kept intermediate files (bitmaps etc), for reference.
http://nishi.dreamhosters.com/u/integers_0.rar

12. ## Thanks (2):

JamesWasil (12th October 2018),xinix (12th October 2018)

13. I created GearEnc for this type of data, not the best but can be better than RAR and 7-Zip:

Best to use command line with Delta enabled.

14. Originally Posted by CompressMaster
I have a huge array of incremental integers and I need to compress them.

All numbers are unique and progressively increasing, but some numbers are cutted-off from sample, thus it´s impossible to express it as a 1 - 6000.

I´ve readed one post at stackoverflow.com where is:

And what if I will have separate text document that would contain the missing numbers? I don´t think so that this approach could lead to 100% compression ratio.

I expect at least 99.95% space saving - or 100% if it´s possible which I don´t think.

Thanks a lot.
CompressMaster
If you want performance, look into Daniel Lemire's integer compression libraries, like this one: https://github.com/lemire/streamvbyte

Depending on your use case, you might not need compression if you can generate the data algorithmically. That's what Gotty was getting at, and why people want to know what the data represents. As a simple example of what I mean, if your data was the Fibonacci sequence, you wouldn't need compression – you'd just use code to generate it.

15. As for integer generation process, the numbers has been generated firstly as an incremental loop from 0-25 000. Then, I´ve used online service to get (randomness) - I´ve using this term carefully, because in general, it is not possible to compress random data of course.

Secondly, I´ve used sorting algorithm to sort all numbers ascending.

Lastly, I´ve got all missing numbers using an online service (see much_more_integers_2.txt).

@JamesB

1. Turn it around - record missing ones and drop the observed ones.
see "much_more_integers_2.txt"

"I get your example file to 2729 bytes."

Thanks.

16. I am by no way and expert in compression

But I'm not sure why the Delta preprocessing suggested in stack can not be use but on integer level to begin with

- All i see is an list of ever increasing integer samples. only small jumps and no falls. perfect for delta preprocessing in my opinion
- then there are some Human formatting I don't know if you want to retain ( enter/new line etc etc) but lets assume you want for this simple test

Code:
```4	4
6	2
7	1
8	1
9	1
10	1
13	3
15	2
17	2
18	1
22	4
24	2
28	4
29	1
30	1
31	1
34	3
35	2
37	2
38	1
40	2
41	1
42	1
45	3
46	1
47	1
50	3
51	1
53	2
54	1
55	1
56	1
58	2
59	1
60	1
62	2
63	1
64	1
65	1
66	1
67	1
69	2```
as you can see all your integer samples are now condensed to a single bytes
heck if it continues with only being 1-4 in increments that only 2 bits you need to encoded for each sample and you don't need a new sample code (aka can remove the enter/nwline bytes as well) since you know each 2 bits is a sample)
this should get you beyond 75% reduction compared to the raw text file
Then look into some simple huffman binary tree compression if you need more and you should be home free with an simple/fast/pretty efficient compression (due to the data)

P.S.
You cannot compress something 100%
Something can not come from nothing.

P.P.S.
The reason you can compress this very simply even though you started with "random" Input is because once you sorted it you tossed away data and removed some of the randomness (the order of the samples)
You no longer trying to recreated the random data

P.P.P.S
If you need to fastly seek into this data you can made chunk of delta compressed block of like 20-50 intergers at a time
so you know that if you need intergeer that numb 207 on the list. go the chunk start 4 and then read and Add the delta values until you hit the 7th sample in that block

P.P.P.P.S
You Integer sample would be down to using only 3717.75 Bytes from just the delta filtering and bit reduction.
That's a 96% compression and you haven't even applied any real compression yet

17. @CompressMaster

(Emphasis by me.)

Originally Posted by CompressMaster
3.1 Both questions are correct. I started with random hexadecimal characters generated by myself written in one text file
Originally Posted by CompressMaster
This number has been generated randomly by myself.
Originally Posted by CompressMaster
By typing RANDOMLY numbers, of course.
Originally Posted by CompressMaster
Description: Each of text file consist of the same character (in this case it´s "A", but it can be replaced with any other symbols from ASCII table) that has been generated RANDOMLY by myself, but it this case randomness will be irrelevant, because it contains only one letter.
Originally Posted by CompressMaster
Then, I´ve used online service to get (randomness) - I´ve using this term carefully, because in general, it is not possible to compress random data of course.

Originally Posted by Gotty
Did you really generate random content? Random content is incompressible, you probably know that.
Originally Posted by Bulat Ziganshin
I bet that all these approaches are going around random data compression
Originally Posted by Gotty
If you generate something random, please don't ask us how to compress it.
Originally Posted by cfeck
I am pretty sure the test file is either randomly generated, or is the result of already compressed data
Can you draw a conclusion?

18. Hey all,

How about compressing an array of 256 Very Large Ints (BigNums or infinite-precision integers), i.e. a frequency table for 8-bit symbols?

This is hard to compress, i think, since the said frequency table is already the output of a compression algorithm (which was fascinating at first).

19. Anyways, the compression algorithm is best understood if you "visualize" a bar chart or a histogram, where new symbol frequencies are always trying to become greater than the current highest frequency, which we increment by its delta with the new symbol's frequency. The new highest frequency becomes the new symbol's frequency; or put simply, the new symbol must have the highest frequency. So at most, the new highest frequency can only "double" or add by 1 bit in length. (In decoding, the symbol with the highest frequency is the symbol to decode; this means it is stack based. We add the delta to the highest frequency during encoding so we can preserve or get back to the previous frequency of the symbol when decoding.) Output is actually the frequency table, which is easy to compress or generate?

Algorithm pseudo-code:

Code:
```/* initialize frequency table. */
for (i=0; i < 256; i++) freq[i] = i + 1;
max = freq[255];

do {
c = get_byte(infile);
if (c == EOF) break;
freq[c] = max + (max - freq[c]);
max = freq[c];
} while (1);```

No "runs" of a single character allowed in the input, as much as possible. "Random data" indeed.

New or recalled observations:

1. This algorithm ironically "expands" the frequencies at first. ? LOL

We're back to the early days of information theory or data compression history!

2. The bombshell: It takes more than 1 bit added to encode for very small frequencies which suddenly must be maximum. The solution might be to "swap" them but this requires new information or codes. This is back to delta coding. haist

3. But a total cycling of the frequency table might work...

4. Instead of 8-bit bytes, use 4-bit symbols;

***

This is similar, i think, to WEB Technologies' algorithm as featured in BYTE magazine in 1992 and noted by comp.compression FAQ:
"WEB, in fact, says that virtually any amount of data can be
squeezed to under 1024 bytes by using DataFiles/16 to compress
its own output multiple times."

I think they were using or playing with a frequency table too, 256 32-bit frequencies = 1K.

They might had to output the MSbit of the highest frequency, the result of which may equal another byte frequency/ies?

That's why they had the problem that 4 numbers in a matrix are equal, a rare case in their algorithm.

Just maybe.

(Ideally, at most 1 bit increase in frequency of output or new symbol, but the bombshell precludes that. If they are of the same bitsize, then only 1 bit increase in the new max frequency.

The current symbol has always the highest frequency.

You decode backwards, from last symbol to first; the symbol with the highest frequency is the current symbol.

One parameter in decoding is the famed file_size().
)

The problem with the algorithm is that the emitted frequency table could be very large due to very large frequencies if you implement it by really using BigNums or BigInts;
You then have to compress the very large frequency table.

Maybe to achieve compression, you can just consider the MSBit after the arithmetic (addition) operation.
Or the solution is nearly just MTF (you have to output the character that *doubled* (MSBit activated)).

WEB Technologies' Datafiles/16 algorithm is clearly designed for compression of *random* data, and recursive, which are futile indeed.

20. > 4. Instead of 8-bit bytes, use 4-bit symbols;

How about 2-bit (base-4) symbols? Or maybe even better, a data source of base-3 symbols ??

21. Originally Posted by compgt
Ideally, at most 1 bit increase in frequency of output or new symbol, but the bombshell precludes that. If they are of the same bitsize, then only 1 bit increase in the new max frequency.
To solve the bombshell, or presence of "anomalous" symbols, (1) you must have a way to create a smooth frequency distribution or the frequencies must be of the same bitsize as much as possible, of which there are many ways to do this, but must be reversible. For example, you can maybe XOR the bytes of the data source first (this is reversible), pre-whitening it for encoding. (2) Or, the bombshell symbol or byte can be thought of as an LZ77 literal, simply output a prefix bit flag (anomalous symbol or not?) for the symbols. This means at most two bits per symbol encoding, with the bit flag to indicate if the symbol sum doubles or MSbit. Plus the anomalous symbol when it happens, 8 bits. I wonder how large would the frequencies or freqtable be...

And, like in Huffman coding, you can contrive or generate a file that is exactly perfect or suitable for this algorithm. What would be interesting is that the generated file is a "random-appearing data" file, perhaps indeed incompressible to known compressors.

(See post above, now has pseudo-code for your easy understanding.)