Results 1 to 8 of 8

Thread: Seekable compression

  1. #1
    Member
    Join Date
    Jul 2013
    Location
    Stanford, California
    Posts
    25
    Thanks
    7
    Thanked 2 Times in 2 Posts

    Seekable compression

    How might a compressor be tweaked or modified so that the inner data is more practically seekable, without strongly impacting ratio and performance? I'd like to focus on compression technologies which are asymmetric: those which decompress very quickly and yet have good compression ratio (and potentially low memory usage), even if extra time needs to be taken to achieve a good compression result. An example family is LZMA. As I understand it, this one (deriving from LZ77) is based upon a principle of a sliding window dictionary, and this in general is understood by me to be a more common and efficient approach compared to that of a fixed dictionary (which is an approach taken by the LZ78, rather than LZ77, family of compressors). A basic principle to achieve greater seekability of the stream is to use a dictionary approach, and to always read the dictionary at the start of an independent block, but when a particular chunk of data within that block is needed, to use that (somewhat fixed) dictionary to decode starting at a nearby offset.

    And with that in mind I wonder what the more common high-performing static dictionary-based encoders are with this fast decompression requirement. It seems that LZJ is a somewhat recent algorithm, but other entries I note like LZMW appear to me to incorporate ideas of an evolving or dynamically changing dictionary if I understand right from this historical description:

    http://www.ieeeghn.org/wiki/index.ph...ary_Algorithms

    I also note a research effort to arrive at LZ77 levels of efficiency but with fast random access, here:

    http://www.dcc.uchile.cl/~gnavarro/ps/dcc10.1.pdf

    Is anyone aware of a follow up to those, such as a modern dictionary based approach, or general research on seekable compression? Or perhaps some source code or implementation of that research example? I note that there are a number of entries for dictionary based algorithms given here:

    http://en.wikibooks.org/wiki/Data_Co...ry_compression

    However they may be classified formally (dictionary based, LZ78-derived, etc), the property I'm noting from these is whether there is greater seekability possible with that algorithm. I would expect for that to be possible, the dictionary would need to be fixed or at least derivable from a strict subset of the data (rather than all of it, or half on average). As an example, maybe the dictionary, index, and metadata can be read (the first 2% to 8% of a 16MB compressed block), and then at the random offset requested, the desired block can be read using the dictionary as a kind of codebook to decompress that specific block. If a compression algorithm originally was derived from a dictionary approach but somehow now requires that all blocks need to be processed, then it wouldn't be in line with an optimally quick method suitable for random access. And although I imagine the basic concept of a "codebook" or dictionary followed by compressed blocks, maybe there are other high performing families of compressors which are seekable. As I suggest below, maybe the seekability property can help in increasing parallel decompression speed.

    Also, I wonder if an existing sliding window dictionary algorithm could be adapted to require that fewer blocks be decompressed to get at a particular random offset. Maybe a first-order approximation is to use an initial preset/seed dictionary as input to the compressor: a concept possible in principle but not yet built into mainstream XZ for example. That at least might help me get past the case that there is a certain pattern in the first 8MB of a 16MB data set which also appears in the second 8MB half. A simple reduction in the blocksize of the algorithm to less than 4MB would cause that redundancy never to be seen. Maybe the entire 16MB full block (as an example) can be processed and the most common strings and patterns separated from the rest. The dictionary can be formed from that and can be used as a basis and starting point for the rest of the block. Although the most efficient coding of the input may traditionally assume that the dictionary must continue to adapt as the input is serially processed (and that there's one primary evolving dictionary), what if the sliding window dictionary state forks itself so that there are several such states (definitely only one used in decompressing a needle within a haystack, but one or two used in compressing, with rollback)? I won't go into detail fully since I am only speculating and brainstorming, but the special case to look into (maybe beyond the tradeoff I have in mind which is finding greater seekability but with other efficiency parameters kept relatively constant), is that of an expected use case of finding a needle (say, 4 kilobytes) within a haystack (16MB) as one is seeking over the data, even if it means not the best sequential performance. Anyhow I just wanted to throw that out there since it might suggest another approach. But I'll suggest a step further, and note that there could be a second order approximation, which is not only that of a fixed preset/seed dictionary at the very start, but that of a vector of locally adapted modifications to that original seed dictionary. Suppose we have a 16MB block of uncompressed data, and some initial 2% to 6% of the space used in the compressed result is used to store a kind of static fixed dictionary (one that takes into account the most probable strings in that 16MB set). But now, rather than letting that be one simple seed used throughout, we consider every 2MB, and we allow for an incremental local adaptation to that 2MB subset of the 16MB data. Even if it's not an entirely new dictionary (maybe some kind of incremental update), for that 2MB region, after analysis we look over that to see which patterns are most common and then use this as a new refined local seed dictionary for that entire 2MB. Presumably, the updated dictionary would take less storage to encode than the original seed dictionary (it may even be a real point of data too rather than just a block). But what this would mean for decoding is that now the seed dictionary would be needed (at the very start of the 16MB), but then to get at an offset within a particular 2MB region from that 16MB, that first dictionary refinement would need to be read. So instead of reading all of the data, a small subset would be read. This is a second order approximation, but of course an arbitrary hierarchy can be formed. But the basic idea is that this would make use of the optimisation of locality but where there would be seekability since a logarithmic or very small subset of the data would need to be read. Locality would mean not only all preceding blocks (with emphasis on the latest), but a primary emphasis on the last few blocks, with a kind of hierarchical efficiency derived from progressively and incrementally updating the dictionary. Maybe it's just a matter of determining empirically for what typical data this local modification would be sufficient to achieve good compression. But if this kind of optimisation is possible (or, considering seekability in compression in general), it could also provide for a greater efficiency in the parallelism of decompression, since a given block would have fewer intermediate dependencies before it is being decompressed. Also, would it be too expensive or costly to modify the "sliding window" to keep track of some global perspective on which patterns and strings are best kept in the dictionary (like an LRU or ARC policy rather than FIFO)? Or maybe there can be soem other lightweight metadata in that dictionary. Maybe statistics can be run on the dictionary itself as it's updated and evolved, as only a portion of that is actively used. As compression and decompression are performed, one could note the earliest time a particular subset of that dictionary could be freed up when it won't be referred to in the future.

    And then I'm wondering when it can be acceptable and possible to drop in an imperfect (but close) representation of a block's dictionary and statistics. I'll illustrate this one by referring to Huffman coding. But suppose we've built up codewords for a particular data set by studying it, along with accompanying code word lengths. Now suppose we come across a data set (or block) with similar characteristics. Presumably we could use the same codebook to encode that similar set, and although there is a rare potential that the compressed size would increase (in which case there could be a fallback algorithm), for the most part it seems to me that there would be an acceptable efficiency slightly short of optimal for that technique. But that lets one save on the cost of using multiple similar such dictionaries, and so other gains could be found. That would work provided we can decide which blocks of data are known to be similar enough to others (maybe by ranking most common symbols and patterns after some kind of normalisation). And then maybe during the compression phase, a dictionary can be described in terms of another according to an incremental delta tool such as bsdiff. Anyhow this could lead to the use of various dictionaries which are incrementally refined, but it is easier to implement if an imperfect dictionary can be "dropped in" even though it may not best repreent the data. Keeping with the asymmetric assumption, we could theoretically afford to try a variety of possible heuristics and combinations of dictionary evolution so that in the output data, the best path to a local dictionary is coded up in a few bits of information even though the check took a fair amount of time to derive at that optimality. Also, maybe a nested approach for achieving efficiency is to preprocess or filter the input, or even de-duplicate it to some extent, drawing upon ideas from lrzip as an example (with a seekable emphasis).

    That said, there are other ways I may imagine to address the practicality of seeking within a compressed scheme (potentially by finding the right algorithm or adapting an existing one). For example, the memory needed to decompress a particular portion of the data can be made to be proportional to the size of that subset of data requested (for example, a 32KB to 1MB region), rather than set as proportional to the size of the entire independently compressed block (set as 16MB throughout this discussion). Basically, in order to decompress just 64K from a compressed block of 16MB, which compression algorithms (especially those with asymmetric performance) could use a much smaller footprint when it is known that only a subset of the result is needed? When there is a fixed dictionary size and where this is intended to evolve serially as the data is processed, it seems that the minimum requirement to decompress any portion of the data is at least one instance of that dictionary. But maybe there are tweaks or optimisations or other ways of thinking of it.

    Also, another constraint might be access time. Though this assumption need not apply to the entire discussion, suppose for a moment that a 128MB (original uncompressed size) block is compressed and we'd like to get at a 1MB to 8MB subset of that. It is one thing to assume that we've already read in the data into memory, but suppose not? Suppose the 96MB (compressed size) is written to disk or even SSD, and there is an overhead of fetching that whole 96MB of data to feed it into memory. So then in this case, with a transparent enough format of the seekable version of this, maybe an initial 4MB of metadata and index can be read, followed by a remaining 0.5MB to 5MB of data (compressed) is then read in to get at our data. So we'd save on access time with this efficiency. Likewise, a fast enough decompression algorithm can impinge upon greater practicality of seekable decompression, because of the following. Suppose the decompression of a 16MB block requires 20MB of memory total at one point (including the full decompressed result -- which as I suggest earlier this requirement to store the entire block could be improved upon). Then as long as the decompression happens quickly enough, it can be possible to pack in greater number of decompression steps in a given time unit, thus effectively saving on average memory use over time.

    It's great to be aware of practical read-only efficiently compressed filesystems such as squashfs (or nbdkit with an xz plugin) which can use a fairly large block size to achieve good ratio. But in principle I'm wondering, what else can be done to make compression algorithms more seekable without sacrificing ratio?

  2. Thanks:

    Nania Francesco (13th August 2013)

  3. #2
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    Seeking for a (uncompressed) string in compressed data is quite a problem in general.

    As you already partitially suggested, using blocks is an idea to reduce the need of decompressing the whole data. You can stop whenever you found the first occurrence. But if you use a streaming compression method, which allows a decompression that is independend of the following data, then you can also save yourself the decompression of all following data that comes after the first occurrence.

    I haven't looked in the dictionary based algorithms too deeply because the patents deterred me. But I have seen that quite some research has been done to search for strings in data which is transformed by a Wheelers transform. But you probably already know that a compression method which is based on a Wheelers transform is somewhat incompatible with a dictionary based method.

    Anyway, all that you would save, is doing the detransform. You would still have to do all the post-transform steps backwards before your programm gets a clue what the content of your compressed data is. Not doing the detransform but searching within the still transformed data probably saves you a little bit of cpu-time. But it is a quite complicated task.

    Saving some cpu-time is a nice thing. But don't forget that the searching within the compressed data could take longer than making a complete decompression and then searching within the decompressed data.

    In my personal oppinion it is not worth spending the development resources to make it possible working with the data which is still compressed or only partitially decompressed.
    Last edited by just a worm; 2nd August 2013 at 09:19.

  4. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    20 years ago i have developed PPM compressor with static tables. it may be a good candidate for compression of text data

  5. #4
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by Intensity View Post
    How might a compressor be tweaked or modified so that the inner data is more practically seekable, without strongly impacting ratio and performance? I'd like to focus on compression technologies which are asymmetric: those which decompress very quickly and yet have good compression ratio (and potentially low memory usage), even if extra time needs to be taken to achieve a good compression result.
    Every now and then, such a topic shows up at the DCC (Data Compression Conference). You might something on the IEEE explorer on this. I believe the right keyword for that is "Wavelet compression". Note that this is *not* the wavelet transformation you might know from imaging, but a tree-type of compression that allows seeking within limits. It reessembles some similarities with the linear wavelet transformation, but is a lossless data compression technology and not a linear transformation.

  6. Thanks:

    Intensity (2nd August 2013)

  7. #5
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    There are at least two problems that have to be solved to have efficiently seekable compression. One is that you have to create some kind of index structure so that you can translate from positions in the uncompressed data stream to positions in the compressed stream. With lossless compression, there's no way to efficiently predict how far into the stream a particular piece of data will end up once it's been compressed, because it depends on either the compressibility of everything leading up to it in the original stream, or on where that data maps to in the BWT (which depends on what's in the rest of the stream). So you can't extract the information in any efficient way from the stream itself. The index basically amounts to taking samples at regular intervals and including them with the compressed output. Higher sampling rates enable faster seeking, but add more space.

    The other problem to solve is to make sure you are able to efficiently reconstruct the content of the data once you've located it. There will generally be a trade-off between the efficiency with which you can restart at random locations in the compressed stream and the size of the compressed data, because having fast restart capability limits the amount of information that can be used in compression (all the information that affects the compressed stream that isn't present in the stream itself at that location will need to be tracked down).

    You'll have to address those two issues no matter what type of underlying compression algorithm you use, and since the solutions to each of them are at odds with achieving the highest possible compression rates, no efficient compression algorithm will naturally have these kinds of features built-in. You could in principle take any algorithm, break its input and output into separately-compressed blocks, and make it efficient to seek to the start of any block.

    The work by Navarro is related to compressed self-indexes. Searching for "compressed self-index" should point you toward lots of good information.

  8. #6
    Member
    Join Date
    Jul 2013
    Location
    Stanford, California
    Posts
    25
    Thanks
    7
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by thorfdbg View Post
    Every now and then, such a topic shows up at the DCC (Data Compression Conference). You might something on the IEEE explorer on this. I believe the right keyword for that is "Wavelet compression". Note that this is *not* the wavelet transformation you might know from imaging, but a tree-type of compression that allows seeking within limits. It reessembles some similarities with the linear wavelet transformation, but is a lossless data compression technology and not a linear transformation.
    Cool, thanks for the reference. I've searched pretty exhaustively for that in the archives but I've not found anything. Do you have further details or know of a specific paper which mentions this concept? From there I can check for other papers which point to it.

  9. #7
    Member
    Join Date
    Jul 2013
    Location
    Stanford, California
    Posts
    25
    Thanks
    7
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    20 years ago i have developed PPM compressor with static tables. it may be a good candidate for compression of text data
    Cool, I believe I've seen that source. Would this be the CM.CPP starting "CM is the static context modeling archiver"? What would the overall impact be of having static tables? Can the size of the tables be adjusted and would the data be addressable according to offset? I wonder how much overhead there is in this approach compared to other CM algorithms.

  10. #8
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Intensity View Post
    Cool, thanks for the reference. I've searched pretty exhaustively for that in the archives but I've not found anything. Do you have further details or know of a specific paper which mentions this concept? From there I can check for other papers which point to it.
    I think he's talking about wavelet trees. The wikipedia page has references to papers.

    You might also want to look at the Re-pair algorithm for compression. Here's an implementation with references. The idea is to separate the text into a dictionary that's structured as a forest of binary trees and a sequence of references, which are encoded separately.

  11. Thanks:

    Intensity (2nd September 2013)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •