Results 1 to 2 of 2

Thread: Broken window compression

  1. #1
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    238
    Thanks
    94
    Thanked 47 Times in 31 Posts

    Broken window compression

    Hi all – When compressors use a sliding window, it's one monolithic window of a specified size (e.g. 32 KiB for DEFLATE), and that window slides from the start to the end of the file.

    Such approaches take no account of particular types of files or data, their size, or their structure. Would it make sense to break the window into several windows, probably of different sizes? For example, we could have 64 KiB window for the first 128 KiB of a file, then maybe a 1 MiB window. We could have several windows (and static dictionaries), depending on the nature and size of the file, which we'll often know in advance.

    One minor example is an HTML file. A lot of the text at the top of the file will never be seen again – e.g. we probably won't get matches on these common strings in the body of the HTML document, or even a chunky substring match:


    • <!DOCTYPE
    • <meta property="og:title" content=
    • <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">


    We could probably break the window after a good chunk of the head section, and start a new window for the body (or start with any CSS in the head, as in AMP, combined with the body, since selectors will yield matches). I don't expect a huge payoff for HTML files, but broken window compression could significantly improve compression in other contexts. Note that we could also concatenate part of the last window with the new window if that was beneficial.

    We could break static dictionaries into several smaller dictionaries depending on the nature and structure of the file. There's a lot more to say about that, but it will take careful research to discover if and when this approach pays dividends.

    Does this sound promising? Has it already been done?

  2. #2
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    113
    Thanks
    11
    Thanked 88 Times in 26 Posts
    I've built commercial delta compression engines using this basic idea. Break you input into fixed sized blocks. Then use a rolling checksum on that data to produce a series of checksums which all match some specific characteristic (like checksum % N == 0 to produce N checksums from the block on average). Then locate previously encoded data where the intersection of the checksums from that reference data and that of the data to encode is sufficiently large. Obviously the decoder will need the same reference data. The preprime your encoder and decoder with this 'similar' data.

    The technique is amazingly strong and can find similar data even when the pool of potential reference data blocks is extremely large.

  3. The Following User Says Thank You to michael maniscalco For This Useful Post:

    SolidComp (11th April 2018)

Similar Threads

  1. O(n) time optimal parsing, broken and useless
    By nburns in forum Data Compression
    Replies: 37
    Last Post: 16th January 2017, 06:50
  2. Large-window brotli (incompatible with standard brotli)
    By Jyrki Alakuijala in forum Data Compression
    Replies: 0
    Last Post: 4th October 2016, 01:45
  3. Replies: 7
    Last Post: 24th September 2015, 06:03
  4. Search is broken on this site
    By boxerab in forum The Off-Topic Lounge
    Replies: 3
    Last Post: 24th May 2014, 22:56
  5. Arithmetic coding broken in IJG-code
    By thorfdbg in forum Data Compression
    Replies: 1
    Last Post: 10th September 2012, 21:22

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •