Today, 00:14
Hi all, I'm relatively new to data compression and have been toying around with enwik8.
After noticing that the page ids (which are strongly structured as <id></id> tags) are sequential in nature with gaps due to - I assume - deletions of dead pages , I tried my hand at doing some delta encoding of the gaps, while knowing that in general I could assume at least a delta of 1 for each subsequent id.
Knowing that there are 12,347 total pages in enwik8 I did as follows:
The largest gap I found was 198, so in the first naive implementation I stored each delta in ceil(log(198,2),1) bits, which was 8, which took up 12,347 bytes. I'm sure no one is surprised by that.
12,347 bytes was much better than the 226,718 bytes that the natural text represented ("____<id>%i</id>\n").
I wondered how I could make that more efficient, so I tried getting a feel for the shape of the deltas, which looked like this:
As you can see the outliers like 198 and 178 massively inflated the bits required, so the next strategy was bit-packing everything but those two with only 5 bits, and then manually fixing those during the decoding process. Now the largest delta was only 25 (5 bits) and that took ceil((12,345 * 5) / 8, 1) = 7,716 bytes (+ a few more for the two repairs), which I thought was a neat improvement.
Next I thought: can I embed that idea directly in an unpacking structure that operates on interval steps?
I worked out the math and found that:
12,347 values as 1 bit: 0 for no additional delta, 1 for additional delta (max is 1)
2,151 values from above had 1's. Most of them don't have any higher value, so I stored another 2,151 values as 1 bit: 0 - for done, 1 additional delta (max is 2)
475 of those had reached the equivalent of 2 by this point. So I stored another 475 as 2 bit values: 0 - for done, 1 - 3 as higher numbers with 3 meaning the possibility of additional delta beyond the sum so far (max is 5)
18 of those reached 5 from before, so I stored 18 as 4 bit values using the same idea, with a max reaching 20
5 of those reached 20, so i stored 5 as 8 bit values.
Altogether, this required 1,544 + 269 + 119 + 9 + 5 = 1,946 bytes.
I wrote out a binary file containing those bit-packed sequences, then a short C program to decode it again and it worked fine.
I threw it into assembly (and I am by no means a talented assembly programmer) and ended up with a 4,096 byte win32 console executable (data embedded) - which I thought was pretty great for a little decoding engine that can unpack the deltas and recreate the original enwik8 strings.
The total compression storage is, compared by type (compared to 4,096 bytes):
Raw enwik8 lines: 226,718 (1.8%)
Parsed chars/no-tags (just the numbers): 53,320 (7.7%)
32-bit integers: 49,388 (8.3%)
8-bit deltas: 12,347 (33.2%)
Control: cmix achieved a 1,685 byte file on the a text file just containing the original page ids.
So I'm pretty sure I've reinvented a wheel here, but I have no idea what the technique is called so I can learn more about it. It's like a delta encoder, but it infers additional interval steps based on an expansion and knowing the ceiling of the previous interval.
Any ideas?
I've attached the raw data and the executable for review, and included a link to the bit packing data as a Google sheet.
enwik8_raw_page_ids.txt - a grep of all page id lines from enwik8
page_id_decoder.exe - extracts the same content as enwik8_raw_page_ids.txt
https://docs.google.com/spreadsheets/d/1Xq9-KVF40BxwUNf6pkXM0z2L6X0vKryT-fw8122VGZE/edit?usp=sharing