Results 1 to 25 of 25

Thread: Sample evaluation

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,011 Times in 536 Posts

    Sample evaluation

    Here's a sample of data I have to compress.
    http://shelwien.googlepages.com/2.bmp
    I'd like to have the estimations of its possible compression ratio,
    and maybe some ideas on its modelling.
    Can anyone help?

  2. #2
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    It's noisy data.

  3. #3
    Member
    Join Date
    Jun 2008
    Location
    L?vis, Canada
    Posts
    30
    Thanks
    0
    Thanked 0 Times in 0 Posts
    See JBIG. Read Pennebaker & Mitchell.

  4. #4
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    563
    Thanks
    213
    Thanked 200 Times in 93 Posts
    The right side (the 250-300 rightmost pixels) seems to be a repetition of the left side most of the time, so you could try to break the scanlines there and fill the remaining part of the new scanline with white (or black?). This would create a picture of ~700*(9133*2) pixels and much white (black?) space (every second line would be ~70% white), but perhaps it'd help delta filters.

    EDIT: An even better way to use this similarities would be to XOR the right side with the left side so that only the differences remain. I think I'll do some experiments, so far I only tested the original sample with some usual algorithms. Results for this:

    Original BMP: 1132554 bytes
    TIF CCITT4: 1762568 bytes
    TIF CCITT3: 1558844 bytes
    GIF: 1168870 bytes

    PNG (no PNGOUT): 1010388 bytes
    CCM 4: 978506 bytes
    paq8p -4: 911886 bytes
    Last edited by schnaader; 21st January 2009 at 20:27.
    http://schnaader.info
    Damn kids. They're all alike.

  5. #5
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    What are these? DCT coefficients?
    BIT Archiver homepage: www.osmanturan.com

  6. #6
    Member
    Join Date
    May 2008
    Location
    England
    Posts
    325
    Thanks
    18
    Thanked 6 Times in 5 Posts
    Looks very much like some sort of audio spectrogram?

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,011 Times in 536 Posts
    @schnaader:
    Thanks, others are too lazy I guess.

    Yeah, this is a table of zero flags for the stereo audio spectral data
    (the audio sample is a concatenation of fragments of several songs).
    And somehow nearly 50% of all output is generated from these flags.

    > The right side (the 250-300 rightmost pixels)

    Its 704+264

    > An even better way to use this similarities would be to
    > XOR the right side with the left side so that only the
    > differences remain.

    I tried that and got 13k worse compression with my model.

    > paq8p -4: 911886 bytes

    Wonder which one was that... My results with paq:

    Code:
              paq8p  paq8p1             paq8p  paq8p1
    2.bmp -1  956832 906648   3.bmp -1  961005 922084
    2.bmp -2  956842 906667   3.bmp -2  961028 922086
    2.bmp -3  956849 906682   3.bmp -3  961105 922111
    2.bmp -4  941561 906693   3.bmp -4  920934 922153
    2.bmp -5  941471 906706   3.bmp -5  920949 922167
    2.bmp -6  941478 906707   3.bmp -6  920940 922199
    2.bmp -7  941623 906711   3.bmp -7  921056 922220
    And at least, thanks to you I finally downloaded the paq8p1
    (which is not linked on Matt's site for some reason), and
    also noticed that paq's compression is not always the best with -7.
    Btw, 3.bmp is the image with interleaved channels.

    In fact, I actually have a model for these flags, and it compresses
    that sample into 889537 bytes (only along with other info though)
    Here it is, btw:
    http://shelwien.googlepages.com/sh_model_z.txt
    (Just to show off my optimizer which quantized these contexts).

    And my problem is that there's a clearly visible structure,
    but compression ratio is only 889537/1132554=~80%.
    So I'd appreciate some ideas about additional context
    (I know that using SSE2 for mixing and more submodels would
    help, but there're speed considerations).

  8. #8
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    You could calculate an auto correlation function to detect high correlations, which outputs a sequence (p_i,c_i), where p_i is the position and c_i the correlation value. Afterwards you sort that list by correlation and and build a context based on the N highest correlated position:

    ctx = p_0|p_1<<1|p_2<<2 ...

    This way you do some kind of nonlinear context clustering, which adapts to the patters you see in there. Calculations shouldn't be too expensive, maybe you can make an adaptive correlation calculation and hey - it's just 0/1. This can be interpreted as some kind of secondary model, since you predict based on the "context history" of the adaptive quantisation. Maybe you can even improve clustering by applying RLE or some other quantisation than the one mentioned above (maybe directly by probability of the context history instead of its pattern).

  9. #9
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    563
    Thanks
    213
    Thanked 200 Times in 93 Posts
    Quote Originally Posted by Shelwien View Post
    > An even better way to use this similarities would be to
    > XOR the right side with the left side so that only the
    > differences remain.

    I tried that and got 13k worse compression with my model.
    Damn.. I really thought that would help...

    Quote Originally Posted by Shelwien View Post
    > paq8p -4: 911886 bytes

    Wonder which one was that...
    Sorry, I forgot to mention that for paq8p, I converted the BMP to 8-bit before compressing so it could use its bitmap model.
    Didn't know that paq8p1 has a 1-bit-BMP model, this really helps (faster and better compression).
    For completeness, paq8p1 -4 result for 8-bit BMP is 911998, so using the 1-bit version is better here.

    Quote Originally Posted by Shelwien View Post
    And at least, thanks to you I finally downloaded the paq8p1
    (which is not linked on Matt's site for some reason), and
    also noticed that paq's compression is not always the best with -7.
    Btw, 3.bmp is the image with interleaved channels.
    The results are interesting. Especially that interleaving helps for paq8p, but not for paq8p1. Guess, that's the bitmap model again. By the way, if it's some kind of audio data, perhaps you could convert it to WAV somehow and have a look how PAQp's WAV model performs.

    Quote Originally Posted by Shelwien View Post
    In fact, I actually have a model for these flags, and it compresses
    that sample into 889537 bytes (only along with other info though)
    [...]
    And my problem is that there's a clearly visible structure,
    but compression ratio is only 889537/1132554=~80%.
    I guess that the structure isn't that useful as there is too much noise in it, at least I would expect a good model (60-70% ratio perhaps) to be carefully tuned to this special data.
    http://schnaader.info
    Damn kids. They're all alike.

  10. #10
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts
    files packed to lzma/deflate/store .zip
    Code:
    lzma	1105099   978932  1-.wbmp
    lzma	1106567   979657  1-.tif
    lzma	1114354   982387  1-.pcx
    lzma	1132554   983368  1-.bmp
    store	 995977   995977  1_out_deflopt.png
    store	 995991   995991  1_out.png
    store	 996006   996006  1_out_deflopt.png
    store	 996060   996060  1_out.png
    lzma	1090206   997557  1_packbits.tif
    lzma	4457032   998064  4-.pcx
    store	1000065  1000065  1_9nofilter_deflopt.png
    deflate	1008908  1008608  1_zip-iv.tif
    deflate	1008995  1008686  1_zip.tif
    store	1010391  1010391  1_9nofilter.png
    lzma	1475044  1052926  1_rle-iv.pcx
    lzma	1475044  1052933  1_rle.pcx
    lzma	1575431  1061441  4_rle.pcx
    lzma	4420490  1068699  4-.bmp
    lzma	8841641  1068730  8--.pcx
    lzma	8841822  1068813  8-.bmp
    lzma	8841530  1069486  8-.tga
    lzma	8841641  1069676  8-.pcx
    lzma   26522286  1148580  24-.bmp
    lzma   26522250  1148754  24-.tga
    lzma	4885832  1162055  8_rle.pcx
    lzma	1295674  1168122  1_Lzw.tif
    store	1168870  1168870  1_iv.gif
    store	1168880  1168880  1.gif
    lzma	6193418  1174452  8_rle.bmp
    lzma	5540966  1177862  4_rle.bmp
    lzma   26522360  1179275  24-.pcx
    lzma	5510979  1188858  8_rle+.pcx
    lzma	5735476  1228236  4_rle.tga
    lzma	5735755  1228774  8_rle.tga
    lzma	1550196  1249548  1_huffmanrle.tif
    lzma	1558844  1254199  1_g3.tif
    lzma   12067812  1300736  24_rle.tga
    lzma   16520960  1311403  24_rle.pcx
    deflate	1372536  1364252  8_Lzw.tif
    deflate	1465843  1465153  1_losless.jp2
    lzma	1762568  1552510  1_g4.tif
    Last edited by chornobyl; 22nd January 2009 at 14:42.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,011 Times in 536 Posts
    More results (with 2.bmp):

    907720 // BMFc.exe from http://compression.ru/ds/bmf2_ohz.rar
    977520 // BMF 1.10 -S (http://compression.ru/ds/bmf_1_10.rar)
    990620 // BMF 1.10
    988011 // bcdr-based CCC/pic coder (http://ctxmodel.net/files/bcdr_sh1.rar)
    936888 // RK 1.04.1 (on grayscale .tga)

    Can somebody please test the WinRK?

    Also, uploaded some alternate data representations:
    http://shelwien.googlepages.com/3.bmp (interleaved channels)
    http://shelwien.googlepages.com/zero.rar (original ascii log)

    @chornobyl:
    Is that a "worst compression" list?
    Thanks for pngout though, didn't test it yet.

  12. #12
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    I thought that m1 should perform really good on such a special file.

    Code:
    81   f = 979865.00
    151  f = 979487.00
    3565 f = 979436.00
    END
    I will have a look if it is going to be better.
    Last edited by Simon Berger; 22nd January 2009 at 20:32.

  13. #13
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    563
    Thanks
    213
    Thanked 200 Times in 93 Posts
    I did some additional tests XORing the right part of the image, compressor paq8p1 -1:

    Original: 906648
    XOR [1]: 976643
    NOT XOR [1]: 957462
    NOT XOR [2]: 921554
    NOT XOR [3]: 925963
    NOT XOR [4]: 924011
    NOT XOR [5]: 931461

    [1] columns 706-968, all lines
    [2] col. 706-968, all lines except 2280..3814, 4588..5350, 6100..9133
    [3] col. 706-968, all lines except 2280..3814, 4588..5350, 6100..7620
    [4] col. 706-968, all lines except 3047..3814, 4588..5350, 6100..9133
    [5] like [3], additional interval 7620..9133 using columns 616-968

    So XOR isn't bad only for Shelwien's model (which seems to suffer less from it), but inverting it and defining some scanline intervals to do it on improves it. Anyway, compressing the unmodified picture is still better...
    http://schnaader.info
    Damn kids. They're all alike.

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,011 Times in 536 Posts
    I didn't yet try to rotate the picture(s)...

  15. #15
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    563
    Thanks
    213
    Thanked 200 Times in 93 Posts
    Quote Originally Posted by Shelwien View Post
    I didn't yet try to rotate the picture(s)...
    That was a good idea paq8p1 -1 again:
    Original: 906648
    Mirrored vertical: 906332
    Mirrored horicontal: 906266
    Rotated 180? right: 905847
    Rotated 270? right: 893554
    Rotated 90? right: 893349

    By the way, the BMP files are 25100 bytes smaller too if rotated 90?/270?, because there's less padding (BMP is padded so each scanline byte count is divisible by 4), this could be one reason why this improves so much.
    Last edited by schnaader; 22nd January 2009 at 22:14.
    http://schnaader.info
    Damn kids. They're all alike.

  16. #16
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Code:
    ocamyd	1105462 920985 0_166-1-5-f.oca
    lzma	1105462 943586 0_bt4_2m.7z
    ppmdh	1105462 951424 4_o2-384m.7z
    deflate	1105462 950108 4_7z-u-p15-deflopt.zip
    rar	1105462 971344 0_m5-2m.rar

  17. #17
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    113
    Thanks
    11
    Thanked 88 Times in 26 Posts
    Quote Originally Posted by Shelwien View Post
    Here's a sample of data I have to compress.
    http://shelwien.googlepages.com/2.bmp
    I'd like to have the estimations of its possible compression ratio,
    and maybe some ideas on its modelling.
    Can anyone help?
    Looks like data which might gain from reordering. Someone suggested rotating the image which is basically a reorder by width.

    The following code is something I wrote a long time ago to dynamically determine the *best* reordering. It isn't always correct but it most often is. It suggests the data you provided is best re-ordered by 124.

    To test this:
    M99 = 978,026
    M99 after reorder = 935,181

    So it looks like a good selection to me.

    - Michael Maniscalco



    unsigned int FindReorder(unsigned char * data, size_t size, size_t maxReorder)
    {
    unsigned int best = 1;
    double bestSd = 0;
    unsigned int count[0x10000];
    memset(count, 0, 0x10000 * sizeof(unsigned int));

    for (unsigned int step = 1; step <= maxReorder; step++)
    {
    unsigned int n1 = 0;
    unsigned int n2 = step;
    unsigned int k = 0;

    for (unsigned int i = 0; i < size; i++)
    {
    unsigned short symbol = data[n1];
    symbol = symbol << 8;
    symbol |= data[n2];
    count[symbol]++;
    n1 = n2;
    n2 += step;
    if (n2 >= size)
    n2 = ++k;
    }

    double m = 0;
    int c = 0;
    for (unsigned int i = 0; i < 0x10000; i++)
    if (count[i])
    {
    m += count[i];
    c++;
    }
    m /= c;

    double sd = 0;
    for (unsigned int i = 0; i < 0x10000; i++)
    if (count[i])
    {
    sd += ((count[i] - m) * (count[i] - m));
    count[i] = 0;
    }

    sd /= c;
    if (sd > bestSd)
    {
    best = step;
    bestSd = sd;
    }

    }

    return best;
    }



    void Reorder(unsigned char * data, size_t size, unsigned int reorder)
    {
    unsigned char * buffer = new unsigned char[size];
    unsigned int n = 0;
    unsigned int k = 0;
    for (unsigned int i = 0; i < size; i++)
    {
    buffer[i] = data[n];
    n += reorder;
    if (n >= size)
    n = ++k;
    }

    memcpy(data, buffer, size);
    delete [] buffer;
    }

  18. #18
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts
    The following code is something I wrote a long time ago to dynamically determine the *best* reordering.
    Can you detect that reordering for enwik8 ?

  19. #19
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    That'll only be useful for images and maybe some other analog data.

  20. #20
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    113
    Thanks
    11
    Thanked 88 Times in 26 Posts
    Quote Originally Posted by chornobyl View Post
    Can you detect that reordering for enwik8 ?
    Sure. Best is suggested to be 1

    As Toffer pointed out, it's really only useful for structured data that has records and for some images. It's not very efficient either. Which is why it isn't in any of my compression engines.

    - Michael Maniscalco

  21. #21
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by michael maniscalco View Post
    Sure. Best is suggested to be 1

    As Toffer pointed out, it's really only useful for structured data that has records and for some images. It's not very efficient either. Which is why it isn't in any of my compression engines.

    - Michael Maniscalco
    Isn't it something like Bulat's Data Tables that are a part of Tornado?

  22. #22
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    113
    Thanks
    11
    Thanked 88 Times in 26 Posts
    Quote Originally Posted by m^2 View Post
    Isn't it something like Bulat's Data Tables that are a part of Tornado?
    No idea. I'm not familiar with that at all.

    The method I posted is simply calculating the standard deviation for order 1 symbols statistics with each reordering. The highest standard deviation is the winner. Extremely low deviation is indicative of random data.

    It's usually correct, but very slow.

    - Michael

  23. #23
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by michael maniscalco View Post
    No idea. I'm not familiar with that at all.

    The method I posted is simply calculating the standard deviation for order 1 symbols statistics with each reordering. The highest standard deviation is the winner. Extremely low deviation is indicative of random data.

    It's usually correct, but very slow.

    - Michael
    OK, thanks for the answer.
    After staring at both codes for a few whiles...no, data tables are more like a delta modification, statistics are also different.

  24. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,011 Times in 536 Posts
    I wonder why there're no suggestions like making a small jpeg (with high loss) out of it, and compressing a diff.
    Also there're some parts which are clearly noisy... Any advice on how to deal with noise in a context model?
    Any model quality visualization ideas? Anything better than xoring bits with (p>0.5) or (p>0.9)?
    Btw, my recent result for that data is 849726

    Also, thanks for a detector source, but this kind of 3D data is not really compatible with BWT.
    So, any progress in 2D+ BWT?
    Like, what about sorting 2D shifts by square half-perimeter pixel sum?

  25. #25
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    873
    Thanks
    49
    Thanked 106 Times in 84 Posts
    Quote Originally Posted by Shelwien View Post
    I wonder why there're no suggestions like making a small jpeg (with high loss) out of it, and compressing a diff.

    i tried something years ago. making a jpeg and a diff file with png (optimized)
    but it always ende up sligtly bigger then just plain png

Similar Threads

  1. Japanese OCR evaluation
    By Shelwien in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 30th May 2010, 14:30

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •