Results 1 to 30 of 93

Thread: Data compression explained

Hybrid View

Previous Post Previous Post   Next Post Next Post
  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts

    Data compression explained

    My free online book on data compression.
    http://mattmahoney.net/dc/dce.html

    I may add some more stuff later. For now it covers most of the basic algorithms like LZ77, BWT, PPM, CM, Huffman, arithmetic coding, JPEG, MPEG, audio, etc, plus some theory. Most of the books on data compression are out of date, so someone needed to write this. Comments welcome.

  2. #2
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Thumbs up

    I did a quick look but its only in the last few years people have used 14 files for the Calgary Corpus. In the early years people compressed the 18 files and every one was trying to get the smallest length. Now they only talk about 14 files and don't care about the length that it compressed to.
    Instead they see how many bits per byte for the selected files. And then take the average of those 14 values. So to me its very confusing. I know on this forum its the 14 files. However when I compare using BWTS with Mark's I am going to stick to the 18.

    But over all good job!
    David Scott

  3. #3
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    5.3.2.5. Results

    There's something broken with the table.
    Especially, there's 6 columns and 5 headers.

    5.7.2

    If precomp can't recompress some stream prefectly, but has something close, it generates patches.

  4. #4
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I've just had a quick look at your text and really appreciate it - i'll just drop a few thoughts shortly, since i'm working actually. Maybe you already got some of that.


    1 mixing via 2d SSE, which is used in some bitwise CM compressors

    1a with quantised probabilities P(y=1|Q(p0),Q(p1)) (CCM)
    http://groups.google.com/group/encod...d17dbd63707f41

    1b with quantised bit histories P(y=1|Q(s0),Q(s1)) (M1)


    2 linear mixing in probability space, i.e. w0*p0 + p1*p2 + ...

    2a two input mixers (1-w)*p0 + w*p1, there exist several update rules, e.g. counter backpropagation, numeric iterative optimization, etc. all with(out), some approximations. You may want to have a look within IRC.
    There actually exist alot of implementations.

    2b N-ary inputs. I once used it in conjunction with a simple gradient update.


    3 Non-flat alphabet decomposition (order-0 Huffman is used in some CTW, order-1 in M1)


    4 CTW seems to be missing


    5 When talking about parameter tuning you may want to mention the direct application of stochastic search algorithms like GA, which is superiour to "ad-hoc" tuning (BEE, Shelwiens coders, M1).


    6 I got some serious refinements to match models (e.g. suffix contexts), which i can publish after making further experiments/implementations.

  5. #5
    Programmer Jan Ondrus's Avatar
    Join Date
    Sep 2008
    Location
    Rychnov nad Kněžnou, Czech Republic
    Posts
    279
    Thanks
    33
    Thanked 138 Times in 50 Posts
    Preformatted text sections (closed in <pre> html tag) are not displayed correctly in my firefox.
    You may remove "<font face=arial,sans-serif>" tag or close all "<p>" elements and it will display correctly.
    Last edited by Jan Ondrus; 26th February 2010 at 11:26.

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Thanks for the comments. I fixed the formatting in Firefox. I guess I should not be so lazy and leave out closing </p> tags. It had looked OK in IE and Chrome.
    I also added some updates based on everyone's comments and added an acknowledgement section. I still plan to add sections for more algorithms like CTW, byte pair encoding, numeric codes (unary, rice, mu-law, floating point, etc), JPEG compression, maybe something about noisy channels, archive formats, etc.

  7. #7
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts

    fantastic!

    It is absolutely fantastic!

    It of course comes at just the right time and at just the right level for my exploration of compression.

    Its a lot easier to read if you've tried to pick up the principles first; if I had come across this book first, I think I'd have needed some diagrams of trees for PPM and such.

    And now I want SEE and SSE covered to the same depth as arithmetic encoding was! But that's just to suit my personal learning agenda.

    I think that HTTP on-the-fly compression with DEFLATE is very common, and often forgotten; presumably gazillions of DEFLATEd bytes are going through the internet every day; probably outnumbering the transfer of archived bytes on the internet.

    (Tiny correction: "I frame" in intra-frame, not inter-frame in video)

    (H.264 intra-frames are typically half the size of the equivalent JPEG. They delta encode blocks within themselves, which is one of the things that helps achieve this feat.)

    I'm not convinced byte pair encoding needs a much mention
    Last edited by willvarfar; 27th February 2010 at 00:42.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,135
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    Here's a version for printing (compact): http://nishi.dreamhosters.com/u/dce2010-02-26.pdf
    Also can be used for precomp bechmarks

  9. #9
    Member
    Join Date
    Aug 2008
    Location
    Saint Petersburg, Russia
    Posts
    216
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Just fabulous. Shame that I can only learn stuff from there rather than help improving it

  10. #10
    Member
    Join Date
    May 2008
    Location
    England
    Posts
    325
    Thanks
    18
    Thanked 6 Times in 5 Posts
    I'll add my thanks for this, as someone who is fascinated with compression i'll have to find the time to read it.

  11. #11
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,023
    Thanks
    415
    Thanked 416 Times in 158 Posts
    http://mattmahoney.net/dc/dce.html#Section_557

    Typo: "the engine for WinRK by Malcolm Tayler" - His name is Malcolm Taylor

    You may specify a BCM compressor home:
    http://encode.narod.ru

  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Thanks.

  13. #13
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    That would be logical, but how do you know?

  14. #14
    Member
    Join Date
    Mar 2009
    Location
    California
    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Sorry for the very late post, I somehow missed this thread. I know the code . Nice write-up BTW.

  15. #15
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    OK thanks. I updated the section. http://mattmahoney.net/dc/dce.html#Section_6163 (computation of avg(u,v)).

  16. #16
    Member
    Join Date
    May 2012
    Location
    Finland
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I see you linked to my errata in section 6.1.6.3, but it seems you missed a few of them in the description still.

    One is that this section:

    pU = U00 - 2.2076(U01 - S01)
    pL = L00 - 2.2076(L10 - S10)

    Should be this (subtration changed to addition):

    pU = U00 - 2.2076(U01 + S01)
    pL = L00 - 2.2076(L10 + S10)

    There might be some more but I am not sure I want to hurt my brain any more by trying to remember that algorithm and all its errors. The errata should basically be correct (unless I made some typos of my own), as confirmed by the fact that I implemented an actual, working decompressor (which did not work without all of the changes listed in the errata, and possibly more in case I forgot any when I wrote it down.)

    (Also, as you point out, the entropy coder is pretty bad. For fun, I tried changing it to a much simpler one like the one used by LZMA and friends, and compression became noticeably better. I also fixed some of the other stupidities in the algorithm, like the entirely unnecessary slicing and separation of components.)

  17. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,135
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    > and compression became noticeably better

    Is it better than packjpg now?

  19. #19
    Member
    Join Date
    May 2012
    Location
    Finland
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I tried it on the image used in http://www.maximumcompression.com/data/jpg.php, but I didn't manage to beat PackJPG although I did improve on the WinZip result.

    I have no idea what other kinds of tweaks might be worth trying, so I just left it at that. If anyone else wants to try playing with it and see if they can improve it, the code is here: http://code.google.com/p/wilt-compre...imental%2FJPEG. Consider it public domain, do whatever you want with it. It's lacking some critical features like actually checking the JPEG file is valid which would need to be added to use it for anything real.

    The parent directory has a little Perl script named "optimize.pl" which will just try all the settings for the shift parameters to see which gets the best result. Run it as "perl optimize.pl JPEG/SimpleCompress file.jpg 9".

    It would be pretty neat if someone could turn this into an actually useful algorithm.

Similar Threads

  1. Any money in data compression?
    By bitewing in forum The Off-Topic Lounge
    Replies: 18
    Last Post: 19th March 2019, 11:34
  2. Data compression group on facebook
    By Matt Mahoney in forum The Off-Topic Lounge
    Replies: 8
    Last Post: 14th May 2010, 23:16
  3. Advice in data compression
    By Chuckie in forum Data Compression
    Replies: 29
    Last Post: 26th March 2010, 16:09
  4. Data Compression Crisis
    By encode in forum The Off-Topic Lounge
    Replies: 15
    Last Post: 24th May 2009, 20:30
  5. Data Compression Evolution
    By encode in forum Forum Archive
    Replies: 3
    Last Post: 11th February 2007, 15:33

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •