Hey guys, so I have an array of unsorted unsigned integers with 1000 unique items between 0 and 1000. (Basically a shuffled array from 0-1000) I've tried many ways to compress it, so far the best I've got is 1524 bytes with lz4 compression. I've compared to lz4hc, zstd0-22, brotli0-11, rANS and fastpfor (compressed int array)

The point of this is to send an array of unique ids not greater than 1000 in preserved order across the network.

Can this be compressed better? I appreciate the help guys!

edit: One more thing I should add, the client on the other side already has the exact same values in an array just in a different order, could I possibly just inform them of the order that they are in? Possibly with a hash and have the client sort the array accordingly? I was also trying out cuckoo filters but they are lossy and I'd prefer lossless techniques.

Last edited by Warvstar; 15th December 2016 at 16:56.

0-1000 values can be encoded in 10 bits.
Sending 1000 value of 10 bits gives you 10000 bits, hence 1250 bytes.
If everything else is random, that's likely the simplest way to reach this performance.

0-1000 values can be encoded in 10 bits.
Sending 1000 value of 10 bits gives you 10000 bits, hence 1250 bytes.
If everything else is random, that's likely the simplest way to reach this performance.

Doh! I can't believe I didn't think of that, it's actually how I was packing some other data.

Originally Posted by Shelwien

@Cyan: stegdict actually encodes the permutation, so result should be around Log[256.,1000!] = 1067 bytes

Interesting! I'm going to look into this more in the morning.

One more thing I should add, the client on the other side already has the exact same values in an array just in a different order, could I possibly just inform them of the order that they are in? Possibly with a hash and have the client sort the array accordingly? I was also trying out cuckoo filters but they are lossy and I'd prefer lossless techniques.

This forum is awesome, thanks for the quick replies guys!

do it's always a permutation of a range, or it may have less elements than the range size?

It could have less than the full range, anywhere from 0-1000 values. Also I made an addition to the original post, the client shares the same list of possible items/values, they just needs to know in what order the server has the items.

The arrays are known to both the client and the server, so you can send the differences between the two arrays by using gamma ZigZag encoding.
In the ideal case (no differences) you will send 1000/8 = 125 bytes(1 bit per item). It is also possible to combine this with Run Length Encoding.
You can use a flag to switch between this encoding and full array transfer.

Oi haha, oops. I was thinking that might be the case. Your demo does appear to have better compression than what I was able to achieve; however, I'm hesitant to use stegdict (GPL) because I'm not sure the license I will release my project under yet, it may very well be closed source as I'm developing a PC game.

Impressive though, as it would shave 16kbps off my network data, topping my dataoutput for 1000 units to 114kbps.

Edit: A little bit extra details wont hurt I guess, the array are the id's, I send the update 10 times a second (100kbps packing the bits, 84kbps if I can get the permutations working). All the other unit data (deltas) I can fit into 33kbps.

1. Its my own project so license is not really an issue

2. You probably don't need the whole of it for your task, as stegdict purpose was a bit more complicated, especially v1 one.
Basically you just need a rangecoder (simpler one than stegdict's rc128 would do if you don't need to support permutations of millions of items),
then you can sort the list and encode the indexes with rc - with STL it probably can be done in like 5 lines of code.
(Encode the position of first sorted list item in unsorted list (0..N-1), delete it from unsorted list,
encode the position of 2nd item in unsorted list (0..N-2), delete it, etc)

3. Its possible to add extra compression, as stegdict considers all permutations to have equal probability.
The most obvious idea is to encode the indexes with a bitwise model, like in lzma etc.

4. Actually you can also try writing item indexes to a file as 16-bit binary numbers, then compress the file with lzma lp1 pb1

The arrays are known to both the client and the server, so you can send the differences between the two arrays by using gamma ZigZag encoding.
In the ideal case (no differences) you will send 1000/8 = 125 bytes(1 bit per item). It is also possible to combine this with Run Length Encoding.
You can use a flag to switch between this encoding and full array transfer.

That sounds like somethig I'vee tried before but I was not able to do it efficiently; however, now that I know what its called I might have a better chance, I'll look into this again. Thanks

Just to make it simple, consider that in order to encode your list, you first have an array of 1001 unique values 0..1000, and that for each position you permute it with one of the next entries that contain the value you want. You just have to remember which permutations you perform, for each position N you only have 1001-N possible permutations. The position you permute with requires less bits as the position progresses, so if you only encode the positions, ideally you'd encode 1001*1000*999*998...*2*1 possibilities = 1001, hence the log256(1001!) size max. For ease of implementation you could decide that you use 10 bits for the first half, then 9, then 8, then 7 etc.. If your values are often partially sorted, you may end up with some constant position offsets and it could make sense to use a variable-length encoding for the positions so that the most common ones take less space.

With a truncated binary encoding of the positions I got 1072...1077 bytes (less than 1% over entropy). Decoder is left as an exercise. Here, 'bc' is any bit encoder class that implements https://en.wikipedia.org/wiki/Truncated_binary_encoding

int array[COUNT]; // initialize with random permutation of values 1...1000 somewhere

int available = COUNT;
for (int val = MIN_VAL; val <= MAX_VAL; ++val) {
int i;
int marked = 0;
for (i = 0; i < COUNT; ++i) {
if (array[i] < val) ++marked;
if (array[i] == val) break;
}
if (available > 1) {
bc.writeTruncated(i - marked, available);
}
--available;
}

With a truncated binary encoding of the positions I got 1072...1077 bytes (less than 1% over entropy). Decoder is left as an exercise. Here, 'bc' is any bit encoder class that implements https://en.wikipedia.org/wiki/Truncated_binary_encoding

int array[COUNT]; // initialize with random permutation of values 1...1000 somewhere

int available = COUNT;
for (int val = MIN_VAL; val <= MAX_VAL; ++val) {
int i;
int marked = 0;
for (i = 0; i < COUNT; ++i) {
if (array[i] < val) ++marked;
if (array[i] == val) break;
}
if (available > 1) {
bc.writeTruncated(i - marked, available);
}
--available;
}

Nice thanks for this! I'm going to loom into this when I get back home tonight[.

The arrays are known to both the client and the server, so you can send the differences between the two arrays by using gamma ZigZag encoding.
In the ideal case (no differences) you will send 1000/8 = 125 bytes(1 bit per item). It is also possible to combine this with Run Length Encoding.
You can use a flag to switch between this encoding and full array transfer.

Can you tell me what I might be doing wrong/overlooking here. Using this code below my array compresses to 1600 bytes. Maybe this technique is not good for completely random orders?

//First create a sorted array with 0-1000 (the same array as on the client)
unsigned int value = 1000;
std::vector<uint16_t> unitdatasorted(value);
for (unsigned int i = 0; i < value; ++i) {
unitdatasorted[i] = i;
}

//Then create a shuffled array with 0-1000 (the one we want to send to client)
std::vector<uint16_t> unitdata(value);
unitdata = Shuffled(unitdatasorted); //This just returns a shuffled array
//Then find differences between the original / perform delta encoding
std::vector<int16_t> unitdatadelta(value);
for (unsigned int i = 0; i < value; ++i) {
unitdatadelta[i] = unitdata[i] - unitdatasorted[i];
}
//Then use zigzag encoding to turn signed to unsigned

std::vector<uint16_t> unitdataencoded(value);
for (unsigned int n = 0; n < value; ++n) {
unitdataencoded[n] = (unitdatadelta[n] << 1) ^ (unitdatadelta[n] >> 15);
}

With a truncated binary encoding of the positions I got 1072...1077 bytes (less than 1% over entropy). Decoder is left as an exercise. Here, 'bc' is any bit encoder class that implements https://en.wikipedia.org/wiki/Truncated_binary_encoding

int array[COUNT]; // initialize with random permutation of values 1...1000 somewhere

int available = COUNT;
for (int val = MIN_VAL; val <= MAX_VAL; ++val) {
int i;
int marked = 0;
for (i = 0; i < COUNT; ++i) {
if (array[i] < val) ++marked;
if (array[i] == val) break;
}
if (available > 1) {
bc.writeTruncated(i - marked, available);
}
--available;
}

Nice! I got this working! I'm not sure it's possible to get it much lower than this.

You must also encode the zigzag array using codecs suitable for small values like gamma coding or variable nibble coding.
You must compare the compressed length of this method and your main method and choose the best.
Use the first 1 bit flag to indicate the method used.
In the ideal case, when there is no difference between the two arrays, you'll encode only 1 bit per element.
This can be exploited only If your arrays are not random, otherwise you'll see no benefits (because of network latency) compressing and sending the arrays through a network.
Without compression we have already 10 bits * 1000 = 1250 bytes. saving only ~100 bytes is not worth the effort.