Results 1 to 4 of 4

Thread: hybrid compression using multiple (non-Markov) macro-contexts?

  1. #1
    Join Date
    Oct 2013
    Filling a much-needed gap in the literature
    Thanked 49 Times in 35 Posts

    hybrid compression using multiple (non-Markov) macro-contexts?

    I'm looking for work (or just search terms) on hybrid (lossless) compression using multiple techniques or dictionaries based on segmenting a file into similar-looking hunks, and switching back and forth between different compressors for different apparent categories of data.

    Here's a motivating example: I was looking at a test file of Bulat's that's basically just a bunch of DLL's concatenated together, and it's (unsurprisingly) clear from a hexdump that it's got batches of different kinds of data interleaved in it, at a scale of anywhere from dozens to thousands of bytes: runs of machine code, runs of 16-bit Unicode, runs of mostly 8-bit ASCII null-terminated strings, and runs of data containing integers or or pointers aligned on some larger power-of-two boundaries.

    That's unsurprising because you'd expect each DLL to usually have several different segments, containing at least a segment of actual code, another segment containing program literals, which might or might not have substantial text strings, and a link table consisting of mostly text strings (often in a different encoding). Depending on how those segments get combined in mostly static or dynamic linkage into larger library DLL's, similar segments from different object files may get concatenated together or left interleaved, and so on. ( Some of these things are driven by compiler/linker segregations of stuff into pages to be mapped executable, or read-only, or writable, and others are due to regularities in how people program and how compilers cash that out.)

    Even within a single object file from a single source file, you often get similar segmentation, with machine code, read-only literals, initialization literals for writable data, debugging info tables, aligned struct literals separate from text literals, larger-aligned stuff preceding stuff with smaller alignment constraints, etc.

    These general kinds of macro-scale regularities are very common in lots of other types of data, including in-memory data and "pickled" data structures (like Java "serialized" files). Memory allocators typically allocate similar objects in a given data structure more-or-less together in memory, one way or another. They're also common in various file formats for various applications, from word processors to 3D games, and often at several levels.

    That makes me think that the first thing a compressor should do (conceptually) is to segment the data into reasonable-sized blobs of dozens to hundreds of bytes, and construct some stereotypes of the different kinds of chunks it sees---using different dictionaries or frequency tables or whatever, if not choosing different algorithms. Overly concretely, this could be "stuff that looks like code," "stuff that looks like structs," "stuff that looks like arrays of word-aligned ints or pointers," "stuff that looks like arrays of structs," and so on. Then it would try compressing the different categories as more-or-less separate streams with potential crosstalk. (E.g., when you see a few zeroes-followed-by-common-Latin1-letters, you up the probabilities of common unicode characters, but don't assume that's all you're likely to see. It might be a link table with alternating strings and pointers of some sort.)

    This seems like an obvious idea that a zillion people must have had, that probably has a name or a few associated keywords I don't know. Intuitively, I'd call it "context", where you have "contexts" like "ascii-ish stuff," "code-ish stuff" "16-bit unicode-ish stuff" and so on, but "context" in compression lingo almost always means Markov-ish literal strings of preceding atomic symbols, and this isn't that.

    I would think that somebody's done a lot of work on how to bottom-up classify blobs of similar-looking data within files to enable this sort of compression, conceptually similar to blob detection and foreground/background detection in image processing...

    Can anybody point me to literature on this, or just terms of art? I know that there are related things (like some zpaq models' ability to model stride contexts) but I don't know the technical names for them, or terms for generalized treatments of how to detect and exploit them.

  2. #2
    Join Date
    Feb 2013
    San Diego
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Paul W. View Post
    This seems like an obvious idea that a zillion people must have had, that probably has a name or a few associated keywords I don't know. Intuitively, I'd call it "context", where you have "contexts" like "ascii-ish stuff," "code-ish stuff" "16-bit unicode-ish stuff" and so on, but "context" in compression lingo almost always means Markov-ish literal strings of preceding atomic symbols, and this isn't that.
    Could you elaborate more on how that context differs from ordinary context? It sounds like what you're envisioning is maybe a little bit slower-adapting and not necessarily connected to the current match, like a sort of background context or mode, or meta-context. I imagine that if anyone has implemented this, Matt has been somehow involved.

    Mixed-up data, like you described, is apparently lethal to the BWT. It would be nice to think of a solution that improved BWT compression efficiency without needing complex machine learning-style algorithms.

    I've read about classifiers that are based on 3-tuples or special hash functions. Come to think of it, what you're talking about sounds like clustering, which encompasses a broad variety of AI and statistical techniques used to identify groups or clusters in data, and has been applied to a wide variety of problems and gotten lots of research over many years in various different fields. It's usually applied to bigger datasets and bigger problems than simple file compression, but you could certainly look into it and try to borrow some of the techniques. Try including clustering in your search terms.

    PS. What you're describing is definitely a type of context. The term lacks a precise definition AFAIK.

    Now that I think about it some more, I recall having spent time thinking about the same, or similar, basic problem. For instance, in Lempel-Ziv you have references connecting phrases to earlier matches, and, taken together, you might imagine the file + references forming a graph with references as edges and file positions as vertices. You could estimate the similarity of a pair of remote regions of the file by determining how well-connected they are (assume now that there's an edge filled in between every match pair, even if the edge would be redundant in LZ). Out of the full set of edges, you could probably pick out quite a few subsets of edges that shared very similar or identical distances among them. I think one goal I was trying to work toward was a LZ algorithm that favored similar-distance references, rather than short references. I also happened to be thinking about the FFT at that time, and was trying to envision connections between regular files and periodic signals so that DSP algorithms could be applied. Repeated patterns occurring at similar distances are not unlike a waveform, with the distance between matches corresponding to the distance between wave peaks. One thing I hoped might have been out there to discover was a new transform analogous to the BWT, but putting more weight on locality of matches, like LZ.
    Last edited by nburns; 1st April 2014 at 16:05.

  3. #3
    Join Date
    Oct 2013
    Filling a much-needed gap in the literature
    Thanked 49 Times in 35 Posts

    context, qualitative segmenting/fingerprinting, clustering, splay tree codes, etc.


    Could you elaborate more on how that context differs from ordinary context? It sounds like what you're envisioning is maybe a little bit slower-adapting and not necessarily connected to the current match, like a sort of background context or mode, or meta-context.
    Yes, that's the kind of thing I'm talking about, if I understand you correctly. I mean "context" in a considerably broader sense than the current LZ match, and more or less independent of the Markov concept of a context, which is just a literal string of literal symbols, such that if you miss a single symbol, your context doesn't apply. You want something a bit broader and less "brittle," looking not just at the last 2 to 10 bytes on average, but dozens to hundreds. That larger kind of "context" is usually crudely and implicitly assumed by things like what basic kind of compression algorithm and parameters you choose---e.g., 8-bit text-oriented or not, and how quickly you forget dictionary or frequency information, often just by picking a block size.

    I do indeed want to be able to detect/exploit clustering of similar data, not just in a straightforward statistical sense, but using simple, fast stats to reveal qualitatively stuff appropriate for choosing models that look for other stuff.


    When you look at various kinds of data in a hex dump that displays, 16 bytes per line, with unsigned char and ASCII interpretations side-by-side, it's very often visually obvious roughly where the boundaries of qualitatively different batches of data are, and what some of the regularities in those data are, even if you can't tell what specific kind of data you're looking at.

    Two simple bellwethers are:

    1. if you have chunks with a lot of alphabetic characters (especially E T A I O N), it's likely Western text or something with text strings interleaved in it.
    2. if you have vertical or slanted stripes of zeroes, you're very probably looking at regularly structured data, with the spacing and slant a big clue to the stride. (Any struct with an integer field is likely to have a zero in the high byte of every instance of that integer, or maybe 0 or FF.) So for example, if you see alternating zeroes and text characters, you're probably looking at 16-bit Unicode, and if you have mostly zeroes with every fourth byte a character, you're likely looking at 32-bit Unicode.) You often get the same kind of visible stripes with printable ASCII among nonprintable byte values, or vice versa, when you're looking at non-text data, just because the printable/nonprintable distinction acts as a kind of screwy hash to discriminate between certain numeric ranges and certain others. (Different hashes would show other qualitative segment boundaries, and you can get a bit of that effect looking within mostly-printing regions by finding patterns in alphabetic vs. numeric vs. punctuation characters or accented characters.)

    There are less visually obvious but mechanically straightforward things to look for, that don't depend on very concrete things like Latin-1 encodings. Any integer field of recurring structures is likely to have 0 for its high byte, and the high byte of any pointer to a heap-allocated object is likely to have some particular byte value that is not 0.

    A basic trick would be to

    1) look for unusually common byte values for the data you're looking at, and then
    2) look for very simple regularities in those, like strides, recencies, and numeric similarity.

    Unless the data are encrypted or compressed, those almost always exist, and I think they should usually be exploitable one way or another.

    If you don't find obvious skew in the byte values, and the occurrences of the commonest byte values are not evidently clustered or strided, then you're probably looking at compressed or encrypted data.

    If you do find obvious skew in the byte values, but the common values are not obviously clustered or strided, you are looking at something like text or variable-length encodings for instructions (or maybe something like arbitrary precision integers or fractions, or weakly compressed data). You need to look for a better heuristic "parse" in some sense.

    The high-level idea is that when you see these sorts of "bellwether" low-level regularities change, you can expect other stuff to change, too, and often in systematic ways because there was some higher-level qualitative change you don't understand. If your data has segments where the most common byte values or their strides or recencies are similar to each other, but different from the rest of the data, you can guess that other exploitable regularities are likely to be more similar between those segments than in the rest of the data.


    A too-concrete but fast version of this would just chop the data up into convenient-sized blocks (say 128 bytes) and count how often it sees certain known simple bellwethers within each block, make a guess as to the basic kind of data it is among a few common types, and apply a specific algorithm or a general algorithm tuned to favor that kind of data by applying certain biases to a dictionary & Huffman table. It might recognize a dozen specific categories like "8-bit Latin 1" "16 bit Latin 1" "x86 machine code" "ARM machine code" "Java VM code" "Python VM code" "heap-like data," "multibyte pixel image data," "audio stream data", and a few others found to be common, and otherwise default to whatever else it would have done, or "incompressible."

    A stronger, slower algorithm would not be limited to known types, and would learn new categories in a more subtle way, by following its initial segmentation with clustering and profiling between similar clusters. So if it's never seen Javascript bytecodes, or POWER RISC instructions, it'll still notice that the former has certain common byte values and is only byte-aligned, while the latter is 4-byte aligned with certain common high byte values, and learn something about other probabilities starting from those distinction. (Likewise for learning different image or audio formats.)

    A basic underlying idea here is that there are a only few common ways of representing data on real hardware, and a modest handful of ways of combining them at the lowest levels of structure, that explain most of the exploitable regularities in data.

    For example, you usually don't really need to know all the different byte-aligned or word-aligned machine/VM code formats to do a pretty good job compressing most of them, if you have a moderate volume of that kind of data to compress. You don't have to know the probability skew of POWER instruction opcodes ahead of time, because you can learn that reasonably well on the fly if you just crudely parse it wordwise and hand it to a roughly appropriate algorithm that isn't thrown completely off by short Markov match lengths.

    Likewise, you don't really need to know the difference between 16-bit Latin-1 English and 16-bit Latin-1 German, if you have a moderate amount of that kind of data to compress. (Any good 16-bit "text compression" algorithm will do fine for either, quickly learning that "the " and "and " are common in one and "der " and "und " are common in the other.) What you most need to know is that you're looking at alphabetic text, with fairly stable word & letter frequencies, and what size the characters are---and you want to know when you've switched from one language or encoding to another. (As in an object file with mostly 8-bit English-ish program symbols in the link tables, but a table of literal strings in 16-bit German, for German users, or a database table dump with English metadata but German data.)

    --- overlong brain dump follows in case you're sufficiently interested... ---

    One of the things that motivates me to think and talk about this is looking at some interesting but old papers on splay-tree codes vs. Huffman codes, then seeing the recent thread here on benchmarking entropy coders.

    There are simple compression schemes for which splay tree codes fairly often work better than "optimal" Huffman codes, revealing another way in which "optimal" is a very iffy word. :-/ Surely an optimal Huffman coder would consistently beat a splay-tree coder, but they don't.

    There are many common kinds of data for which the assumptions underlying "optimality" proofs (for LZ, Huffman, etc.) are simply untrue---the simple regularities present can't be captured by any straightforward Markov model, and the frequencies are not stable enough in the right known way, even though there's a simple dominant pattern you can easily see in a hex dump or maybe a byte-match distance or MRU transform.

    For example, suppose you're compressing an 24-bit RGB image with something like LZH, Huffman-coding the LZ77 offsets. It will often chop the byte stream up wrongly and weirdly, because it doesn't know there's a 3-byte stride, or a larger stride at the raster size, and doesn't model the interaction between parsing and later implicit modeling & encoding. It will therefore give you a lot of matches of varying lengths around 3, but also around 6 and 9, and it just looks Zipfy or something. The weaker peaks at the raster interval will be broadened even more, because matches to recently-seen bytes within a raster tend to mask potential matches at the raster interval.

    Now suppose you use a really dumb compression algorithm that only matches single bytes, and entropy codes the offsets. It works surprisingly well---you get a lot of matches at exactly 3 bytes (one pixel) back, a moderate number at 6 and 9, and a similar but weaker and broader peak at around the raster interval, where the previous raster crossed the same object in the picture.

    IIRC, a splay-tree code often works better than either LZH or just Huffman-coding single-byte matches, but in any event it works surprisingly well, because the splay tree is very twitchy and adjusts to keep the codes for very small offsets very short AND to keep the paths for intervals around the raster stride short---it "learns" the pixel stride as seen as it sees the first repeating byte in the first color channel, and "learns" the raster stride as soon as it makes its first match between rasters.

    Whether splay-tree coding will outperform "optimal" Huffman codes depends on things like how fast you decay your Huffman stats---whether your Huffman is comparably "twitchy" and fast-learning---compared to how big your images are and whether they're interleaved with other kinds of data, and on how you parse interleaved color channels as "strings." (Including other or images of "the same kind" but with different raster sizes.)

    Splay tree codes often "forget too fast"---the first time they see a match at an unusual distance they rebalance that hunk of the tree and implicitly lengthen lots of near-by codes for common distances, undoing some of the useful, code-shortening imbalances in the tree. A handful of fairly random-distance matches can undo most of the tree-skewing you get from adapting to the raster size.

    (That suggests to me that just a little bit of hysteresis might go a long way to making them more tightly competitive with optimally-adaptive optimal Huffman codes, so that when they're worse they're less worse, but I'm not sure how best to do that.)


    It seems to me that splay tree codes are very interesting because they can implicitly encode something resembling the notion of "context" we're talking about, and that's another reason they sometimes outperform Huffman codes in poorly-thought-out experiments.

    Suppose you're compressing text, for example, and you have certain common words or phrases with relatively stable frequencies, like "the ", "and ", "if the ", etc., plus others that occur in more or less local, probabilistically clustered ways---if a given several comes up, several strings from a cluster are likely to come up. When when you see one or two, it raises the odds of seeing each of the others. (E.g., if you've recently see "constitution," "court" and "decision", you're unusually likely to see "overturned", "Justice," "opinion," etc. soon.)

    A splay tree can remember some of those kinds of regularities, if you use it just right. (But it's easy to use it a bit wrong, as I think people often do.)

    A splay tree has a very interesting "spatial locality"-exploiting property, in that items with adjacent keys move in correlated ways. When you rotate something you touch up to the root, you also tend to rotate the adjacent nodes in the tree about halfway to the root, the ones on either side of that about a quarter of the way, and so on. (Or something like that.)

    If you have that kind of locality in your keys, a splay tree will tend to outperform an "optimal" Huffman code. If you don't, a Huffman code will predictably tend to be better, assuming it adapts at about the right rate.

    That's something that a lot of people seem not to get. They think that a splay-tree coder is a drop-in replacement for an adaptive Huffman coder, but it's only so-so for that. The fact that it exploits spatial locality in similar keys being accessed at similar times means it's suboptimal if that regularity is not present---it's shortening codes of similar keys when it should not, and thus lengthening codes for dissimilar keys when it doesn't have to.

    For images, you can profit from this (as above) by using relative offsets (distance back from scan) as keys. That exploits both "dimensions" of how splay trees adapt, shortening codes for recently seen offsets (by adjacent rasters) and things near them (by adjacent pixels).


    For normal text, which does not have that kind of spatial regularity, you'd want to do something else. You want to use the two-dimensional adaptivity of splay trees to adapt to two other dynamically-correlated-but-different "dimensions" of your input.

    I think of a splay tree as doing a kind weighted 2D clustering, where similarity in the keys is used as a fixed weighting for clustering, and how recently you've seen a key gives you a very twitchy dynamic weighting in the other dimension.

    Suppose, for example, that you chop your text file up into strings in some fixed way, and use the absolute ordering of the first occurrence of those strings in the file as the keys to index your tree. (E.g., maintain a hash table, assigning new strings consecutive serial numbers, and maintaining a reverse mapping.)

    Choosing that ordering will ensure that things that are first seen together are near each other in the tree, which is usually a good thing for several reasons:

    1. Strings with stably high frequencies will usually show up early, and all be roughly together, so seeing a few common strings will shorten the codes for the other common strings.
    2. Strings that stably tend occur near each other will also be roughly together, and the closer they occur to each other the first time, the closer they'll be. Less obviously...
    3. Strings that never occur near each other will not be close in key space, so shortening the code for one will not wrongly shorten the code for the other at the expense of lengthening codes worth shortening.
    4. Even if the first occurrences of some set of strings are not correlated with later-occurrences, the damage will at least be limited by the vertical sorting, if later accesses are fairly stably correlated.

    For example, suppose that your file has a header segment in which the strings are atypical, followed by the bulk of the data, which have certain stable properties and normal spatial clustering. That's kinda bad. If the strings in those segments are disjoint, that's not a problem---you have that co-occurring set of strings clumped together where they do no harm, but if they overlap, and you have some common strings in that atypical section, that's a bit of a problem. When you get to the bulk of the data, the common strings from the first part will tend to rise to the top of the subtree indexing the first part, and have shortish codes, but they won't be as short as they should be, because of the spatial pollution by the weird strings near them in that non-representative initial section.

    Like the image-compression algorithm, this has some nice properties, and seems like a good idea for a simple algorithm with worst-case n log n amortized complexity, but something closer to linear when there's decent temporal and/or spatial locality. Like with the image compression algorithm, it will tend to exploit spatial co-occurrence of strings, even if that co-occurrence isn't strictly sequential, and do pretty well if it is sequential. (E.g., if one string usually immediately follows another, they'll usually have adjacent keys as well, so once you touch one, the code for the other will tend to be short, often just a couple of bits. In effect, it will notice some longer strings than tend to occur together as longer strings.)

    One pretty fast linear-time way to reduce the clustering errors in key assignments would be to do a preliminary pass going backwards through the string sequence, counting occurrences and keeping track of first touches (meaning last touches, since you're going backwards), then use that information to avoid grouping things with very different frequency counts or times of last touch. That would at least help you avoid grouping a lot of things in key space that are frequently not touched together, weakening your key-spatial locality & lowering your compression. (You could use sampling to cut that down and probably get most of the same benefit at a fraction of the cost).

    With or without that, it should be pretty easy to construct preloaded dictionaries that work pretty well for several qualitatively different kinds of data sources, e.g., different human languages/dialects/styles/topics, or different programming languages, by lumping training data that are common across several sources ahead of source-specific training data segregated by which data source they're common for. Then when you use it for real, the splay tree will quickly raise up the subtrees for whatever language / dialect / blob-of-related-jargon-terms you actually use, and give it short codes, while unused ones drift rapidly the down the tree.

    I imagine that if anyone has implemented this, Matt has been somehow involved.
    I was thinking that too, which is one reason I'm rambling about this here... it seems likely he's tried some of this or similar stuff, or knows of others' related work I should look at.

  4. #4
    Join Date
    Oct 2013
    Filling a much-needed gap in the literature
    Thanked 49 Times in 35 Posts
    OK, until just now I didn't realize just what "word reduction transform" really meant, and its significance.

    The xwrt preprocessor, as I understand it, finds the the common strings (with more than a few occurrences, e.g., eight) and moves them to the front of the file. If those strings are stably common, that should reduce the "pollution" of clustered strings with unclustered strings, and make a coding that exploits spatial locality (similar offsets) like a splay tree coder work better.

    If too many of those "common" strings are not stably common, though---if they're very common in some parts of data but not others---it won't work as well. (An example would be terms related to some subject that is discussed a lot in certain chapters or subchapters of a book, but rarely if at all in most others.)

    I wonder if the somewhat disappointing performance of splay tree codes for normal text compression is partly because of a failure to get this approximately right.
    Last edited by Paul W.; 6th April 2014 at 21:21. Reason: typo

Similar Threads

  1. Pipe multiple files into SREP
    By SvenBent in forum Data Compression
    Replies: 8
    Last Post: 29th March 2019, 16:19
  2. Srep with multiple files support ?
    By SvenBent in forum Data Compression
    Replies: 3
    Last Post: 30th September 2010, 20:41
  3. Probability estimation for near-empty contexts
    By Shelwien in forum Data Compression
    Replies: 0
    Last Post: 15th September 2010, 01:26
  4. PPM with sparse contexts
    By encode in forum Data Compression
    Replies: 5
    Last Post: 9th July 2010, 03:37
  5. New Idea for Hybrid 7-Zip Archiver
    By DeathTheSheep in forum Data Compression
    Replies: 22
    Last Post: 30th December 2008, 22:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts