Results 1 to 8 of 8

Thread: Sequential read/write for a compression/decompression API?

  1. #1
    Member
    Join Date
    Nov 2019
    Location
    USA
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Sequential read/write for a compression/decompression API?

    The question: Are there known compression/decompression algorithms or implementations which allow or require out-of-order reads or writes?

    Why I'm asking: I'm giving some feedback on an API for compression and decompression of data. (We want the API to be flexible for future use with as-yet-unknown algorithms.) The API accepts callback functions for reading and writing data. If I can assume sequential reading and/or writing, I can eliminate the offset parameter from the read and/or write callbacks.

    Thanks.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,347
    Thanks
    212
    Thanked 1,011 Times in 536 Posts
    Well, there's this:
    Code:
    ZEXTERN z_off_t ZEXPORT gzseek OF((gzFile file, z_off_t offset, int whence));
    
         Sets the starting position for the next gzread or gzwrite on the given
       compressed file.  The offset represents a number of bytes in the
       uncompressed data stream.  The whence parameter is defined as in lseek(2);
       the value SEEK_END is not supported.
    
         If the file is opened for reading, this function is emulated but can be
       extremely slow.  If the file is opened for writing, only forward seeks are
       supported; gzseek then compresses a sequence of zeroes up to the new
       starting position.
    
         gzseek returns the resulting offset location as measured in bytes from
       the beginning of the uncompressed stream, or -1 in case of error, in
       particular if the file is opened for writing and the new starting position
       would be before the current position.
    But I think you can safely discard the offset, since there're no compression algorithms with random access anyway.
    I mean, there're compressed formats with random access, but these are implemented by independently compressing data blocks,
    so its not a part of actual compression algorithm.

    Although I suppose you can still provide API like that to the user, just with your own emulation layer
    (don't expect compression libraries to provide random access, just implement it in your wrapper).
    However it would likely require adding some simple container format for compressed data.

    Also, I'd like to note that i/o callbacks are very inconvenient for C++ programmers,
    because its usually necessary to define global wrappers that pass control to methods
    (if callbacks have support for an opaque pointer at all), and having standalone i/o methods
    is already inconvenient, because of (un)necessary preparation required to use them
    (local vars copied to class members, superclass pointers copied to class members...).

    I suppose a reasonable "native" C++ solution would be to accept lambdas as callbacks,
    but it would cause lots of compatibility/portability issues, and isn't especially efficient either.

    So good compression libraries tend to use a coroutine-like API, where some function is
    called multiple times for processing, and new buffers are passed to it depending on
    returned status values.
    https://github.com/facebook/zstd/blo...ib/zstd.h#L576

    In my codecs the API is based on actual coroutines, and looks like this:
    https://github.com/Shelwien/ppmd_sh/...er/pmd.cpp#L54

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,505
    Thanks
    741
    Thanked 665 Times in 359 Posts
    if you mean random access to uncompressed data, I would rather say that it's a sort of "archive" algorithm. Such algorithm are possible (f.e. we can keep dictionary or PPM stats used to decompress small messages), although I've not seen any implemented

    if you mean whether (de)compression algo may need to access data non-sequentially, the main example are algorithms with multiple output streams such as BCJ2. IMHO, it's better to add to r/w callback number of I/O stream and then perform mixing-and-seeking at the API implementation level

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,347
    Thanks
    212
    Thanked 1,011 Times in 536 Posts
    Bulat is right, there're some known algorithms with multiple input and/or output streams (BCJ2, some recompression or preprocessing libs).
    And with multiple streams there's always an issue of stream interleaving.
    In particular, 7-zip i/o callbacks do include a Seek method: https://github.com/upx/upx-lzma-sdk/.../IStream.h#L41

    I still think that its better to avoid it (interleaving can be handled internally, eg. by flushing the codec when further buffering is impossible),
    but its true that random access would be necessary to implement .7z (de)compression, which can be considered a "known algorithm".

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,505
    Thanks
    741
    Thanked 665 Times in 359 Posts
    I still think that its better to avoid it (interleaving can be handled internally, eg. by flushing the codec when further buffering is impossible)
    It's not always possible. F.e. SREP decompression consumes matches info in the MATCH SOURCE order while compressor obviously produces them in the MATCH DESTINATION order

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,347
    Thanks
    212
    Thanked 1,011 Times in 536 Posts
    I guess srep is a good example, since there's always more storage than RAM, and dedup algorithm can use random access to compare or fetch data fragments.
    In this case multi-stream i/o is not a workaround, so it looks like a truly universal API would require both position and stream_id arguments.

  7. #7
    Member
    Join Date
    Nov 2019
    Location
    USA
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thank you both for the feedback. I'll be discussing our options with my coworkers soon.

    Quote Originally Posted by Shelwien View Post
    Also, I'd like to note that i/o callbacks are very inconvenient for C++ programmers,
    because its usually necessary to define global wrappers that pass control to methods
    (if callbacks have support for an opaque pointer at all), and having standalone i/o methods
    is already inconvenient, because of (un)necessary preparation required to use them
    (local vars copied to class members, superclass pointers copied to class members...).
    We're doing OOP in C.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,347
    Thanks
    212
    Thanked 1,011 Times in 536 Posts
    > We're doing OOP in C.

    That's actually worse, because you just lack syntax sugar?
    Just imagine having to setup a compression pipeline, eg. BCJ+LZMA, with BCJ and LZMA as standalone codec objects with their own callbacks.

    I also remembered another potential problem - when codec is MT, a callback can be called in context of a random thread.

Similar Threads

  1. Sequential implementation of ANS-like entropy coder
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 2nd July 2017, 13:50
  2. Standard compression library API
    By Bulat Ziganshin in forum Data Compression
    Replies: 12
    Last Post: 2nd April 2017, 07:56
  3. Open Source Streaming API for Compression
    By harry in forum Data Compression
    Replies: 7
    Last Post: 30th September 2014, 08:28
  4. How To Write Unmaintainable Code
    By Piotr Tarsa in forum The Off-Topic Lounge
    Replies: 4
    Last Post: 27th September 2014, 01:00
  5. Standard for compression libraries API
    By Bulat Ziganshin in forum Data Compression
    Replies: 47
    Last Post: 30th March 2009, 07:10

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •