Results 1 to 7 of 7

Thread: Data segmentation for effective compression?

  1. #1
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi,
    I have some idea for my Bit compressor. I've noticed that some files need segmentation for effective compression (like BSP files in games). Of course, this is not a new idea First time, I've seen it in WinRK. I believe that by using segmentation we can highly compress any disk images (like ISO), compound document files (like PDF, DOC etc.). Also, we can use different compression algoritm for each piece of the file i.e. JPEG repacking for embedded JPEG files, wavelet, S+P or any familiar algorithms for WAV or any other uncompressed audio files, BWT for text part, ROLZ for binary part. Do you have any practical idea?

    I have only one idea about segmentation. We can roughly break the file into pieces as printable and non-printable characters with a threshold. My opinion is not about just breaking into pieces but also sticking some part of pieces. Imagine, we have some printable character blocks in the file. If we stick these blocks each other and then compress this huge block with BWT like algoritm, I believe we can reach even better result.

    Best Regards
    BIT Archiver homepage: www.osmanturan.com

  2. #2
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    Yes, that should be an important of a good archiver and I think Bulat will do something like this in the future . He does some sort of content analysing but I don?t think it?s more then one content per file.
    What you write is in my eyes a strategy of precomp. It extract low level archivs, compresses jpgs and could do many other jobs later. paq also compresses embedded jpg streams (at least in some files).

  3. #3
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I agree. Bulat had made some great filters and preprocessors. Good job Bulat! But, I never heard like that algorithm: "break the file into pieces, then stick familiar pieces together and finally compress each huge block with specialized algorithm". I'm not sure this is a new idea or not. Maybe someone teach me compression history

    I'm aware PAQ brillant embedded jpeg processing. This leads compression on compound documents like doc, pdf etc. But, I'm talking about further. But, maybe mine is useless for practical usage
    BIT Archiver homepage: www.osmanturan.com

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    > But, I never heard like that algorithm: "break the file into pieces, then stick
    > familiar pieces together and finally compress each huge block with specialized
    > algorithm". I'm not sure this is a new idea or not. Maybe someone teach me
    > compression history

    Shkarin's durilca does something like that... or can do at least,
    in the form of segmentation + solid compression.

    Also durilca is a good example of general preprocessing approach,
    where one doesn't implement custom models for all supported data formats
    (which is most effective, but too time-consuming), but instead
    transforms data into the form "understandable" to universal compressor.
    Eg. array of 32bit ints would be more compressible by a bit- or byte-oriented compressor if you first apply some deltas and
    variable-length byte encoding.

  5. #5
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks Shelwien. After digging in google I've found Shkarin's seg_file. I believe that, this preprocessor is simplified version of durilca. Because, it was failed on my main test file (valley.cmb: 19.776.230 bytes). Throw an exception: "Error: too big file!". I've tested it with calgary corpus and q3dm11.bsp (the biggest map file in Quake 3 game). I've interesting results like that:
    Code:
    BIT 0.2 (x64 version) 
    - Calgary Corpus (17 segments) 
      852,099 bytes (61.246 seconds) 
     
    - Calgary Corpus (Direct Processing) 
      837,781 bytes (147.046 seconds) 
     
    - Q3DM11.BSP (41 segments) 
      1,807,597 bytes (386.351 seconds) 
     
    - Q3DM11.BSP (Direct Processing) 
      1,778,338 bytes (767,992 seconds) 
     
     
    PAQ9a -7 
    - Calgary Corpus 
      676,856 bytes (10,220 seconds) 
     
    - Q3DM11.BSP 
      1,473,670 bytes (19.360 seconds)
    At timing column, my current implementation suffers from linear search for finding best ROLZ index (there are 32768 match indexes!). So, my implementation even slower than PAQ9a. So, currently segmentation only gains compression time. Actually this is useless. Because, when we get rid of linear search, we achieve more faster compression.

    I believe most of people in this forum should test at least a game map for their compressor (espacially a BSP map). Why? Because, a well used bsp file consists several data types at the same time. For example, a regular Quake 3 bsp map file consists such things:

    - A huge string for holding map entity decleration
    - A huge string which is broken into fixed sized (64 characters) strings (holds material paths)
    - 32 bits integer for some indexing
    - 3D floating-point vectors
    - A huge bitstream for visibility for each map area (actually called as cluster)
    - A huge 3D grid which holds quantized 3D vectors for local dynamic lighting
    - A huge chunk which holds several 256x256 raw RGB images (lightmaps)

    I noticed, Shkarin's seg_file has broken Quake3 bsp file into pieces nearly perfect, like hand-optimized

    I think, Shkarin's seg_file is more useful for the fast LZ compressor which have small dictionary size and maybe PPM class compressor. Seems not useful for large dictionary LZ related compressors (like mine which have 16 MB)
    BIT Archiver homepage: www.osmanturan.com

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    > Thanks Shelwien. After digging in google I've found Shkarin's seg_file
    > (http://www.compression.ru/ds/seg_file.rar). I believe that, this preprocessor
    > is simplified version of durilca.

    I'd not say it like that, but basically durilca is seg_file+disasm32+ppmonstr.
    So its not simplified or anything.

    > Because, it was failed on my main test file
    > (valley.cmb: 19.776.230 bytes). Throw an exception: "Error: too big file!".

    Well, I uploaded some older durilca version which is better for such tests.
    http://shelwien.googlepages.com/durilca2_002a.rar
    Its the last version which still had the hidden "-l" switch for segmentation
    results dumping. (Usage: durilca e -t1 -l ccc.tar)

    > I noticed, Shkarin's seg_file has broken Quake3 bsp file into pieces nearly
    > perfect, like hand-optimized

    Better look what it does to executables, its much more interesting

    > I think, Shkarin's seg_file is more useful for the fast LZ compressor which have
    > small dictionary size and maybe PPM class compressor. Seems not useful for large
    > dictionary LZ related compressors (like mine which have 16 MB)

    Well, its not useful only when compression is really poor.

  7. #7
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    > I'd not say it like that, but basically durilca is
    > seg_file+disasm32+ppmonstr.
    > So its not simplified or anything.

    Well, there was a misunderstood I was only talk about durilca's preprocessor part. I mean seg_file seems a simplified version of durilca's preprocessor. So, we are talking about exactly same thing

    > Well, I uploaded some older durilca version which is better for
    > such tests.
    > http://shelwien.googlepages.com/durilca2_002a.rar
    > Its the last version which still had the hidden "-l" switch for
    > segmentation results dumping. (Usage: durilca e -t1 -l ccc.tar)

    I think, Shkarin don't like you Because, most of his works become public by you. Thanks a lot anyway. You and Shkarin are good guys for compression world. Thanks a lot for your works.

    Let's see the results with durilca's preprocessor+my ugly rolz compressor
    Code:
    BIT 0.2 (x64 version) 
    - Calgary Corpus (16 segments) 
      851,943 bytes (61.216 seconds) 
     
    - Q3DM11.BSP (12 segments) 
      1,786,701 bytes (452.277 seconds) 
     
    - Valley.cmb (3 segments) 
      9.277.375 bytes (1408.269 seconds) 
     
    - Valley.cmb (Direct Processing) 
      9,272,426 bytes (I was really bored while compressing at 12 Kb/sec :D)

    I've noticed, I didn't give you information about test file size. Here is the tested file sizes:
    Code:
    Calgary Corpus: 3,152,896 bytes 
    q3dm11.bsp: 8,490,236 bytes 
    Valley.cmb: 19,776,230 bytes

    By having these results, I believe generic segmentation doesn't help my compressor. It seems I must write several specific models for specific files

    I have some idea for my new compression schema. I'll make these compressors:

    - ROLZ+CM: My current implementation is only a draft version of it. I'm polishing it It will be used for generic types of data.
    - BWT+State Map+ARI: Matt's works seems great for text compression. Thanks again.
    - S+P transform with VLI on semi-programmable structure: I believe this will rock most of benchmark! One simple codec for images and audio files - even do 15 bits image structures!
    BIT Archiver homepage: www.osmanturan.com

Similar Threads

  1. Any money in data compression?
    By bitewing in forum The Off-Topic Lounge
    Replies: 18
    Last Post: 19th March 2019, 11:34
  2. Data compression explained
    By Matt Mahoney in forum Data Compression
    Replies: 92
    Last Post: 7th May 2012, 19:26
  3. Advice in data compression
    By Chuckie in forum Data Compression
    Replies: 29
    Last Post: 26th March 2010, 16:09
  4. Data Compression Crisis
    By encode in forum The Off-Topic Lounge
    Replies: 15
    Last Post: 24th May 2009, 20:30
  5. Data Compression Evolution
    By encode in forum Forum Archive
    Replies: 3
    Last Post: 11th February 2007, 15:33

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •