Results 1 to 12 of 12

Thread: Need advice on compressing specific sample of data

  1. #1
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts

    Need advice on compressing specific sample of data

    I have a homework on course of Images Compression. The task is to develop algorithm that compresses some sample of quantized signals (I guess), so that sum of Java or Python source code size + compressed file size is smallest.

    Provided sample for testing is 100 000 bytes: http://www65.zippyshare.com/v/35605610/file.html
    Complete file is 10x times as big. It has to compress in under one minute.

    If someone is bored then he could give me some hints

    My current log:
    Code:
    -rw-r--r-- 1 piotrek piotrek  43025 2011-04-20 21:08 z10.pmd
    -rw-r--r-- 1 piotrek piotrek  40727 2011-04-20 21:08 z10.pmm
    -rw-r--r-- 1 piotrek piotrek 100000 2011-04-20 20:52 z10.txt
    -rw-r--r-- 1 piotrek piotrek  46465 2011-04-20 20:55 z10.txt.bz2
    -rw-r--r-- 1 piotrek piotrek  50292 2011-04-20 21:17 z10.txt.fpaq0f
    -rw-r--r-- 1 piotrek piotrek  50431 2011-04-20 21:17 z10.txt.fpaq0f2
    -rw-r--r-- 1 piotrek piotrek  51601 2011-04-20 21:14 z10.txt.fpaq1
    -rw-r--r-- 1 piotrek piotrek  47852 2011-04-20 20:56 z10.txt.gz
    -rw-r--r-- 1 piotrek piotrek  48328 2011-04-20 21:21 z10.txt.jtlzp
    -rw-r--r-- 1 piotrek piotrek  41063 2011-04-20 20:56 z10.txt.lzma
    -rw-r--r-- 1 piotrek piotrek  54851 2011-04-20 21:47 z10.txt.lzw
    -rw-r--r-- 1 piotrek piotrek  38908 2011-04-20 21:11 z10.txt.paq8l
    -rw-r--r-- 1 piotrek piotrek  48026 2011-04-20 21:19 z10.txt.tlzp
    tlzp is an output from TarsaLZP.

  2. #2
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    459
    Thanked 257 Times in 105 Posts
    The sample output is strange. It looks as if it was some kind of output from a weak STn applied to a text file.

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,008 Times in 533 Posts
    I tried visualizing it and it looks like a RLE-compressed picture somehow.
    Also there's only 19 symbols (a-s), so packed uncompressed size would be ~53100
    ( http://www.wolframalpha.com/input/?i=Log%5B256%2C19^100000%5D )
    Taking into account the source size, it might be not so bad actually.
    With 64-bit arithmetics its seems possible to pack into 53102 or so
    (http://www.wolframalpha.com/input/?i=Log%5B256%2C19^15%5D*Ceil%5B100000%2F15%5D)

  4. #4
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    237
    Thanks
    39
    Thanked 92 Times in 48 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    The task is to develop algorithm that compresses some sample of quantized signals (I guess), so that sum of Java or Python source code size + compressed file size is smallest.
    No restrictions on memory usage and speed?
    In this case paq8* will win, I guess. Or a simplified and tuned paq8*, if you have competitors and/or time for improvement.

    Quote Originally Posted by Piotr Tarsa View Post
    My current log:
    Code:
    -rw-r--r-- 1 piotrek piotrek  43025 2011-04-20 21:08 z10.pmd
    -rw-r--r-- 1 piotrek piotrek  40727 2011-04-20 21:08 z10.pmm
    -rw-r--r-- 1 piotrek piotrek 100000 2011-04-20 20:52 z10.txt
    -rw-r--r-- 1 piotrek piotrek  46465 2011-04-20 20:55 z10.txt.bz2
    -rw-r--r-- 1 piotrek piotrek  50292 2011-04-20 21:17 z10.txt.fpaq0f
    -rw-r--r-- 1 piotrek piotrek  50431 2011-04-20 21:17 z10.txt.fpaq0f2
    -rw-r--r-- 1 piotrek piotrek  51601 2011-04-20 21:14 z10.txt.fpaq1
    -rw-r--r-- 1 piotrek piotrek  47852 2011-04-20 20:56 z10.txt.gz
    -rw-r--r-- 1 piotrek piotrek  48328 2011-04-20 21:21 z10.txt.jtlzp
    -rw-r--r-- 1 piotrek piotrek  41063 2011-04-20 20:56 z10.txt.lzma
    -rw-r--r-- 1 piotrek piotrek  54851 2011-04-20 21:47 z10.txt.lzw
    -rw-r--r-- 1 piotrek piotrek  38908 2011-04-20 21:11 z10.txt.paq8l
    -rw-r--r-- 1 piotrek piotrek  48026 2011-04-20 21:19 z10.txt.tlzp
    40900 bytes -- lpaq8 6
    lpaq8 sources can be compressed to less than 9000 bytes if necessary

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  5. #5
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    561
    Thanks
    212
    Thanked 200 Times in 93 Posts
    Look at the attached image, it shows ~200 lines of the data using width 61 (contrast and brightness changed a bit to make things more clear, blue border not part of the data). It shows some clear patterns and a delta filter/prediction would definitely help here.

    Unfortunately, other regions of the image have other widths, also some parts of it don't show any patterns.

    So perhaps it would help to identify such regions, perhaps reordering them, and applying delta filters or similar things.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	z10_61_.png 
Views:	234 
Size:	6.3 KB 
ID:	1560  
    Last edited by schnaader; 21st April 2011 at 03:47.
    http://schnaader.info
    Damn kids. They're all alike.

  6. #6
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    561
    Thanks
    212
    Thanked 200 Times in 93 Posts
    Made a small C++ program that checks match count for certain widths on blocks of data - see attachment.

    A "match" occurs if byte at position X is the same as the byte at position X+w where w is the width.

    Output for block size of 1000 bytes, width=4..250, 33% matches at least:

    Code:
    Best width for bytes 5000..6000: 61 (44.10% matched)
    Best width for bytes 6000..7000: 61 (66.20% matched)
    Best width for bytes 7000..8000: 61 (57.20% matched)
    Best width for bytes 8000..9000: 61 (69.70% matched)
    Best width for bytes 9000..10000: 61 (59.10% matched)
    Best width for bytes 10000..11000: 61 (66.70% matched)
    Best width for bytes 11000..12000: 61 (41.40% matched)
    Best width for bytes 15000..16000: 61 (67.50% matched)
    Best width for bytes 16000..17000: 61 (71.40% matched)
    Best width for bytes 17000..18000: 61 (60.20% matched)
    Best width for bytes 18000..19000: 61 (53.30% matched)
    Best width for bytes 25000..26000: 55 (71.00% matched)
    Best width for bytes 26000..27000: 49 (78.30% matched)
    Best width for bytes 34000..35000: 49 (46.50% matched)
    Best width for bytes 47000..48000: 9 (35.30% matched)
    Best width for bytes 48000..49000: 18 (36.20% matched)
    Best width for bytes 55000..56000: 55 (72.50% matched)
    Best width for bytes 69000..70000: 9 (43.00% matched)
    I tested with some other settings and results are quite consistent. About 20% of the data has more than 33% matches.
    Attached Files Attached Files
    Last edited by schnaader; 21st April 2011 at 03:43.
    http://schnaader.info
    Damn kids. They're all alike.

  7. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Attached is output of fv (converted to png) which is described in my book ( http://mattmahoney.net/dc/dce.html#Section_2 ). It shows location of string matches in the file. Horizontal axis is relative location in the file. Vertical axis (log scale) is distance to the previous match. Color indicates length of the match: black=1, red=2, green=4, blue=8. For example, the horizontal bars at multiples of 61 in some places show a repeating structure. paq8* recognizes repeating structures like this so it gets good compression. Also, along the top of the image you can see that some matches span the entire file. The lack of blue (except one small area) shows a lack of long matches, so using higher order contexts won't help much. It also looks like bits of structured data mixed with random data.

    The second image shows byte distribution. Vertical axis is byte 0 at the bottom and 255 at the top.

    Tools like these are useful for developing new compression algorithms. However I don't see anything here that would help me do any better than existing methods like paq8. I would have to know how the data was created.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	fv.png 
Views:	245 
Size:	123.3 KB 
ID:	1562   Click image for larger version. 

Name:	vdist.png 
Views:	224 
Size:	9.4 KB 
ID:	1563  

  8. #8
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Thank you all for your feedback.

    Cyan:
    That's very unlikely. Teaches probably doesn't know what STn or BWT is. In fact I'll be making presentation of it soon

    Shelwien:
    Yes, there's limited alphabet. It would be easiest to just assign fixed and equal probabilities for all of symbols, but compression would suffer.

    Alexander:
    It has to compress full sample of data (million bytes) in under one minute. Memory limits are probably very high, but anyway you don't need much resources to compress million characters. Source code won't be compressed, but whitespace will not be counted. And source code must be in either Java or Python (or Jython).

    Schnaader:
    Thanks for your analysis and your program. I think it's simple enough and it should improve compression so much that it's definitely worth implementing (in Java probably, as Python is awfully slow).

    Matt:
    I don't know how exactly data has been created. Teacher said that it: "results from quantization of some real process". I think, it could mean anything.

    Teacher also said that there were too many symbols in file. It firstly did a quantization of process and that resulted in few symbols (less than five IIRC), so he tried twice to increase number of symbols somehow. He said that such process introduced additional randomness (or something like this, I don't remember exactly). He also prolonged the time limit for homework/ contest to 29.06.2011 and created another very similiar contest. Maybe if time limit ends then he will tell how he created the data.

    New contest is to compress another 1.000.000 bytes of data (although program will be tested on 2.000.000 and 5.000.000 bytes samples also). It contains fewer symbols and is said to have better structure. New file is here: http://www65.zippyshare.com/v/80350721/file.html (ie there's only 10 % of data released to public).

    New logs from compression using usual programs:
    Code:
    -rw-r--r-- 1 piotrek piotrek  30962 2011-04-21 10:58 X10.pmd
    -rw-r--r-- 1 piotrek piotrek  29215 2011-04-21 10:23 X10.pmm
    -rw-r--r-- 1 piotrek piotrek 100000 2011-04-21 10:21 X10.txt
    -rw-r--r-- 1 piotrek piotrek  34117 2011-04-21 10:22 X10.txt.bz2
    -rw-r--r-- 1 piotrek piotrek  39095 2011-04-21 10:27 X10.txt.fpaq0f
    -rw-r--r-- 1 piotrek piotrek  38762 2011-04-21 10:27 X10.txt.fpaq0f2
    -rw-r--r-- 1 piotrek piotrek  35444 2011-04-21 10:22 X10.txt.gz
    -rw-r--r-- 1 piotrek piotrek  29334 2011-04-21 10:22 X10.txt.lzma
    -rw-r--r-- 1 piotrek piotrek  27952 2011-04-21 10:26 X10.txt.paq8l
    It's interesting that LZMA is so close in performance to PAQ8 or PPMonstr. Is there anything special in LZMA's literal coder?

  9. #9
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    In that case, I really recommend to observe common match length/offsets via Matt's fv. LZMA usually outperform the others on very redundant data. Also, periodic data (i.e. ctx = pos & 3) is a good context for LZMA context model.
    BIT Archiver homepage: www.osmanturan.com

  10. #10
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    459
    Thanked 257 Times in 105 Posts
    Is there any information on what each character means ?
    Quantization seems to indicate a numeric result. Numeric means an order, a closest neighbor, and so on. Letters does not provide such information.
    Do you know if these letters are just an offset to the initial numeric value ?

  11. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    I think (ie. I'm pretty sure) that quantized values were converted to letters by just adding ASCII code for letter 'a'. Usual compressors, besides delta coders, doesn't care about alphabet permutation, so compression of eg. random data in range 'a' - 't' should achieve same results as compression of random data in range 'b' - 'u'.

    Output of fv for second sample is IMO very similiar to first case, except that matches are longer, which is a consequence mainly of smaller alphabet.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,008 Times in 533 Posts
    > It's interesting that LZMA is so close in performance to PAQ8 or PPMonstr.

    Code:
    4x  29292 // lzma e X10.txt 4x -d25 -fb270 -mc99999999 -lc0 -lp0 -pb2 -mt1 -mfbt4
    4xr 28510 // rec.exe c 4x 4xr (lzmarec v4b)
    4z  40964 // lzma e Z10.txt 4z -d25 -fb273 -mc99999999 -lc0 -lp0 -pb2 -mt1 -mfbt4
    4zr 40154 // rec.exe c 4z 4zr
    
    5x  29493 // deppm Ir1 /o32 (ppmonstr for which we have source, ~57k though)
    5z  41330 // deppm Ir1 /o32
    
    6x  30855 // kzip
    6z  43323 // kzip
    
    1x  30933 // green v3b (~6k source)
    1z  42948 // green v3b
    
    8x  28976 // ash 04a /s3 (~35k source)
    8z  40726 // ash 04a /s3
    > Is there anything special in LZMA's literal coder?

    Likely not, but its still a good precise arithmetic coder, and switching it to order1
    (via -lc actually produces worse results.
    Also with -a0 (no parsing optimization) it produces 33k+ for x10.txt.

    Btw, my lzma decoder source is 6999 bytes, around 5k w/o spaces.

    I can also suggest "green" -
    http://encode.su/threads/541-Simple-...ll=1#post10910

Similar Threads

  1. Need some advice for order-N CM
    By Sebastian in forum Data Compression
    Replies: 19
    Last Post: 23rd November 2010, 19:54
  2. Advice in data compression
    By Chuckie in forum Data Compression
    Replies: 29
    Last Post: 26th March 2010, 16:09
  3. Compressing small bits of data
    By fredrp in forum Data Compression
    Replies: 9
    Last Post: 28th April 2009, 23:27
  4. Sample evaluation
    By Shelwien in forum Data Compression
    Replies: 24
    Last Post: 30th January 2009, 00:43
  5. Replies: 33
    Last Post: 24th October 2007, 13:39

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •