Results 1 to 13 of 13

Thread: Compressing a collection of large bitmaps.

  1. #1
    Member
    Join Date
    Aug 2020
    Location
    Italy
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Compressing a collection of large bitmaps.

    After some reductions I transformed my original problem into the problem of compressing a (large) collection of big (10e9 * 10e9) bitmaps (0-1 matrixes).
    My goal would be to compress such a collection and possibly be able to extract from the compressed data only a selected bitmap not needing to decompress the whole sequence.
    Do you suggest any particular compressor to achieve such a result?

  2. #2
    Member
    Join Date
    Jun 2018
    Location
    Yugoslavia
    Posts
    62
    Thanks
    8
    Thanked 5 Times in 5 Posts
    black and white images? jbig2 has highest compression, I think. But it can't do solid compression.
    In 7z, for example, you can set a special window that speeds up decompressing single file.

  3. #3
    Member
    Join Date
    May 2008
    Location
    Kuwait
    Posts
    354
    Thanks
    37
    Thanked 38 Times in 23 Posts
    I don't think that jbig support multi-page but i know that DJVU that has a similar encoder called JB2 does support multi-page and generated dictionary can be made for multi-page (you can set your self).. also the generated file can be saved as multi-page file of pages with dictionary file if i remember well.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    One line in such an image would take 125MB.
    Total size of uncompressed image would be 125PB - its unlikely that you have enough storage.
    There's no existing image format with support of solid compression for that kind of resolution.
    And non-solid compression is independent compression of image blocks - overall resolution doesn't matter in that case.

    Depending on properties of actual image data, it may be a good idea to maintain a hierarchy of downscaled images
    (maybe grayscale instead of 1bpp), something like mipmap.

  5. #5
    Member
    Join Date
    Aug 2020
    Location
    Italy
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts
    I suppose I was a bit cryptic in my explanation, let me try to explain myself better.
    I used the term bitmap as I was assuming it is a synonym of bit-matrices (matrices where each row/column is a 0-1 string).
    So I've this large collection of large bit square matrices.
    By the structure of my original problem I know that the i-th row maybe similar to the i-th row of the previous and next elements of the collection.
    (in my head I'm thinking of this collection as a video, so we have frames close in time that are similar and so redundant.)
    By know I'm trying to approximately (exact should be NP-Hard) cluster together similar frames so that I could deploy some kind of LZ-compression algorithm exploiting large runs of very similar frames.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    Well, you can use CDC for fast detection of long matches: https://www.usenix.org/sites/default...slides_xia.pdf
    Or compare downscaled versions as I suggested before.
    I guess you can also try putting rows into video frames and using video codecs for compression - it should be better than LZ at least.

  7. Thanks:

    compressed_doggo (7th September 2020)

  8. #7
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    83
    Thanks
    6
    Thanked 18 Times in 11 Posts
    Is it possible to publish an example file? If interesting I would rewrite an old piece of code for this specific problem.

  9. #8
    Member
    Join Date
    Aug 2020
    Location
    Italy
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts
    I do not have a file in that specific format. If u need that I could write a little script to save that kind of file. Sorry for that but I'm still designing solution in Python and so efficiency is still not quit there.
    By the way, so far I've been starting from files containing for each line a triple (u, v, t).
    When you read a triple (u,v,t) you just perform an "assigment" like:
    collection[u][v][t]=[v][u][t]=1;

    If you want I may send you a file in that format. Let me know and thanks for the interest.

  10. #9
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    83
    Thanks
    6
    Thanked 18 Times in 11 Posts
    Quote Originally Posted by compressed_doggo View Post
    I do not have a file in that specific format. If u need that I could write a little script to save that kind of file. Sorry for that but I'm still designing solution in Python and so efficiency is still not quit there.
    By the way, so far I've been starting from files containing for each line a triple (u, v, t).
    When you read a triple (u,v,t) you just perform an "assigment" like:
    collection[u][v][t]=[v][u][t]=1;

    If you want I may send you a file in that format. Let me know and thanks for the interest.
    I am still interested. Please send me a file.

  11. #10
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Is there anything you could compute based on parts of data? This could be implemented in zpaq.

  12. #11
    Member
    Join Date
    Aug 2020
    Location
    Italy
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Dresdenboy View Post
    Is there anything you could compute based on parts of data? This could be implemented in zpaq.
    Each binary-matrix (can I call that a bitmap or is it confusing?) is the adjacency matrix of a graph.
    I've a collection of these binary-matrices since my main goal is to find a way to compress a dynamic graph (i.e.: a graph that changes over time). In fact each matrix is the representation of the graph at some time. I can assume w.l.o.g. that matrices "close" in time are similar.
    By the way, to work on parts of data on their own I should first try to find something like (strongly?) connected components.
    Moreover representing graphs as matrices may be convenient only in the case the graph is not too sparse.

  13. #12
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    83
    Thanks
    6
    Thanked 18 Times in 11 Posts
    Thank you for the file compressed_doggo. It is a 5mb text file with tripleds of (around 17bit) unsigned integers. To be more precise: the first 2 values have a complexity of 17 bits. The last value is clearly a year between 1900 and now. This has around 7 bit of complexity.
    Converting them to binary integers will bring the complexity back to around 6 byte for each row. This is before compression and keeps the structure of the file predictable while being already significantly smaller than the old one.

    If you want to be able to extract specific rows from the file I would advise to create a static huffman tree which is optimized for the data. If you want to do stuff on file level I would just pick your favorite archive program that has a compression/speed ratio that satisfies you.

  14. #13
    Member
    Join Date
    Aug 2020
    Location
    Italy
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Kaw View Post
    Thank you for the file compressed_doggo. It is a 5mb text file with tripleds of (around 17bit) unsigned integers. To be more precise: the first 2 values have a complexity of 17 bits. The last value is clearly a year between 1900 and now. This has around 7 bit of complexity.
    Converting them to binary integers will bring the complexity back to around 6 byte for each row. This is before compression and keeps the structure of the file predictable while being already significantly smaller than the old one.

    If you want to be able to extract specific rows from the file I would advise to create a static huffman tree which is optimized for the data. If you want to do stuff on file level I would just pick your favorite archive program that has a compression/speed ratio that satisfies you.

    Firstly, thanks a lot for the interest. I'm pretty sure I'm not interested at all at doing stuff on file level.
    I'll give a look at static huffman trees, I do not recall some details.
    My only doubt is that by sticking to the "triplets representation" I am not exploiting the temporality assumption on the data.

    Instead, what if I look at these triples as points in a 3d space? Do you know about any (compressed) data structure working like an index that let us retrieve points in O(1) ?

Similar Threads

  1. garbage collection: how to handle this situation
    By unique identifier in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 18th February 2019, 01:17
  2. Compressing two (or more) large but similar files
    By nanoflooder in forum Data Compression
    Replies: 1
    Last Post: 2nd March 2017, 19:41
  3. Adler32 on Large Blocks
    By encode in forum Data Compression
    Replies: 1
    Last Post: 2nd January 2015, 21:38
  4. LZSS with a large dictionary
    By encode in forum Data Compression
    Replies: 31
    Last Post: 31st July 2008, 21:15
  5. Large text benchmark
    By Matt Mahoney in forum Forum Archive
    Replies: 39
    Last Post: 13th January 2008, 00:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •