Results 1 to 6 of 6

Thread: What encoding technique is this called?

  1. #1
    Member
    Join Date
    Dec 2019
    Location
    Washington, D.C.
    Posts
    4
    Thanks
    0
    Thanked 2 Times in 2 Posts

    What encoding technique is this called?

    Hi all, I'm relatively new to data compression and have been toying around with enwik8.

    After noticing that the page ids (which are strongly structured as <id></id> tags) are sequential in nature with gaps due to - I assume - deletions of dead pages , I tried my hand at doing some delta encoding of the gaps, while knowing that in general I could assume at least a delta of 1 for each subsequent id.

    Knowing that there are 12,347 total pages in enwik8 I did as follows:

    The largest gap I found was 198, so in the first naive implementation I stored each delta in ceil(log(198,2),1) bits, which was 8, which took up 12,347 bytes. I'm sure no one is surprised by that.

    12,347 bytes was much better than the 226,718 bytes that the natural text represented ("____<id>%i</id>\n").

    I wondered how I could make that more efficient, so I tried getting a feel for the shape of the deltas, which looked like this:

    Click image for larger version. 

Name:	image.png 
Views:	35 
Size:	16.9 KB 
ID:	7123

    As you can see the outliers like 198 and 178 massively inflated the bits required, so the next strategy was bit-packing everything but those two with only 5 bits, and then manually fixing those during the decoding process. Now the largest delta was only 25 (5 bits) and that took ceil((12,345 * 5) / 8, 1) = 7,716 bytes (+ a few more for the two repairs), which I thought was a neat improvement.

    Next I thought: can I embed that idea directly in an unpacking structure that operates on interval steps?

    I worked out the math and found that:


    • 12,347 values as 1 bit: 0 for no additional delta, 1 for additional delta (max is 1)
    • 2,151 values from above had 1's. Most of them don't have any higher value, so I stored another 2,151 values as 1 bit: 0 - for done, 1 additional delta (max is 2)
    • 475 of those had reached the equivalent of 2 by this point. So I stored another 475 as 2 bit values: 0 - for done, 1 - 3 as higher numbers with 3 meaning the possibility of additional delta beyond the sum so far (max is 5)
    • 18 of those reached 5 from before, so I stored 18 as 4 bit values using the same idea, with a max reaching 20
    • 5 of those reached 20, so i stored 5 as 8 bit values.


    Altogether, this required 1,544 + 269 + 119 + 9 + 5 = 1,946 bytes.

    I wrote out a binary file containing those bit-packed sequences, then a short C program to decode it again and it worked fine.

    I threw it into assembly (and I am by no means a talented assembly programmer) and ended up with a 4,096 byte win32 console executable (data embedded) - which I thought was pretty great for a little decoding engine that can unpack the deltas and recreate the original enwik8 strings.

    The total compression storage is, compared by type (compared to 4,096 bytes):

    • Raw enwik8 lines: 226,718 (1.8%)
    • Parsed chars/no-tags (just the numbers): 53,320 (7.7%)
    • 32-bit integers: 49,388 (8.3%)
    • 8-bit deltas: 12,347 (33.2%)


    Control: cmix achieved a 1,685 byte file on the a text file just containing the original page ids.

    So I'm pretty sure I've reinvented a wheel here, but I have no idea what the technique is called so I can learn more about it. It's like a delta encoder, but it infers additional interval steps based on an expansion and knowing the ceiling of the previous interval.

    Any ideas?

    I've attached the raw data and the executable for review, and included a link to the bit packing data as a Google sheet.

    [Attachments]
    • enwik8_raw_page_ids.txt - a grep of all page id lines from enwik8
    • page_id_decoder.exe - extracts the same content as enwik8_raw_page_ids.txt


    [Links]
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	image.png 
Views:	21 
Size:	16.0 KB 
ID:	7122  
    Attached Files Attached Files

  2. Thanks:

    JamesWasil (7th December 2019)

  3. #2
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    Hi Cottenio, and welcome to the Encode Data Compression forums!

    What you've done is interesting, as it's a form of lossless Delta Compression combined with intelligent binary headers and flags speculating distances for page data, while compressing the gaps by the shortest predictions possible with logarithmic bits required for each structure.

    There really should be a name for structural targeting and compression of file attributes that are outside of LZ pattern matching or other forms of weighted context mixing and partial matches, etc. As far as I'm aware, there is no official name for it yet, but perhaps there should be one and a glossary of nomenclature to accompany it?

    Although there are names and definitions for things mathematically that are commonly understood and accepted for naming convention; things like order 0 translating to 1 byte analysis, order 1 as 2 bytes, order 2 as 3, etc as n+1 always, etc.

    There are many things not defined by a static name that may be beneficial to assign, and your neat work (whether you partially reinvented the wheel or not ) brings the importance of that to the forefront.

    I suppose we should name it the Cottenio Delta algorithm, since it is yours and it is a form of delta encoding. What do you guys think?

    P.S: You may want to see if there are ways to apply this and tailor it to be useful with other text files that are outside of enwiki8...perhaps focusing upon spaces and page breaks CR+LF (chr(13)+chr(10)) or other commonalities in text to preformat it for compression. There are several ways you might go about implementing this, like detecting how many words exist before a space or sequence of common characters, removing them, and then representing them with binary flags similar to how you did with the page IDs and missing pages from enwiki8.

    That said, it might end up being a form of precomp if you're able to isolate the bit flags and have the rest of the file remain text data that can still be worked upon by other compressors, adding to their efficiency. That's a way to approach it and one idea for it, but there are many more I'm sure.


    P.P.S: If you do decide to expand upon it further or tie together this implementation with a form of LZ hybrid for a stand-alone text-based compressor, you might find some of Ross Williams work from years ago as beneficial to do that, available freely on http://ross.net/compression/introduction.html

    (You still might want to make it a precomp and use cmix for better compression, but you have plenty of options)
    Last edited by JamesWasil; 7th December 2019 at 09:00. Reason: Added ross.net suggestion

  4. #3
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    399
    Thanks
    278
    Thanked 283 Times in 149 Posts
    Also refer to Fastest and smallest enwik8 skeleton (de)compression (for fun)
    The observation is the same, the solution is a little bit different.

  5. Thanks:

    JamesWasil (7th December 2019)

  6. #4
    Member
    Join Date
    Dec 2019
    Location
    Washington, D.C.
    Posts
    4
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by JamesWasil View Post
    Hi Cottenio, and welcome to the Encode Data Compression forums!

    What you've done is interesting, as it's a form of lossless Delta Compression combined with intelligent binary headers and flags speculating distances for page data, while compressing the gaps by the shortest predictions possible with logarithmic bits required for each structure.

    There really should be a name for structural targeting and compression of file attributes that are outside of LZ pattern matching or other forms of weighted context mixing and partial matches, etc. As far as I'm aware, there is no official name for it yet, but perhaps there should be one and a glossary of nomenclature to accompany it?

    ...

    There are many things not defined by a static name that may be beneficial to assign, and your neat work (whether you partially reinvented the wheel or not ) brings the importance of that to the forefront.

    I suppose we should name it the Cottenio Delta algorithm, since it is yours and it is a form of delta encoding. What do you guys think?

    P.S: You may want to see if there are ways to apply this and tailor it to be useful with other text files that are outside of enwiki8...perhaps focusing upon spaces and page breaks CR+LF (chr(13)+chr(10)) or other commonalities in text to preformat it for compression. There are several ways you might go about implementing this, like detecting how many words exist before a space or sequence of common characters, removing them, and then representing them with binary flags similar to how you did with the page IDs and missing pages from enwiki8.
    Thanks for the warm welcome James! I really appreciate your insight and will definitely try out the technique on other sources as well; thanks for the links!

  7. Thanks:

    JamesWasil (7th December 2019)

  8. #5
    Member
    Join Date
    Dec 2019
    Location
    Washington, D.C.
    Posts
    4
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by Gotty View Post
    Also refer to Fastest and smallest enwik8 skeleton (de)compression (for fun)
    The observation is the same, the solution is a little bit different.
    Hi Gotty! Ha, you're absolutely right, and what's crazy is I had the same thought you did and am building a minimal skeleton first. I have similar results to yours, although I also found some fun ways to infer data about the relationship with timestamps and revision_ids and storing the errors from predicted values.

    I'll definitely check out your code as well!

  9. #6
    Member
    Join Date
    Dec 2019
    Location
    Washington, D.C.
    Posts
    4
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Good news! I applied the idea (recursive deltas) to identifying the placement of the gaps (since the data is otherwise sequential) instead of bitmapping the placements first, and it brought the total encoding size down to 1,421 bytes, which handily beat the cmix best benchmark.

    Per James' suggestion I'm going to write up a quick tool to do compression/decompression automatically on "delta compatible" data like this.

Similar Threads

  1. Shannon Fano Technique Codification
    By Anim in forum Data Compression
    Replies: 0
    Last Post: 6th February 2018, 18:42
  2. Encoding question
    By cassini in forum Data Compression
    Replies: 13
    Last Post: 5th October 2016, 10:10
  3. Replies: 38
    Last Post: 27th April 2016, 18:01
  4. Codebok encoding
    By kredens in forum Data Compression
    Replies: 1
    Last Post: 29th October 2015, 08:57
  5. Compression technique
    By Rudi in forum Data Compression
    Replies: 4
    Last Post: 8th May 2015, 04:44

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •